Skip to main content

Backend-agnostic ontology / knowledge-graph toolkit — build a clean KG from documents or tables, then search it with one-shot GraphRAG. Zero infra, any graph DB.

Project description

xgen-ontology

Backend-agnostic ontology / knowledge-graph toolkit. Turn documents or tables into a clean knowledge graph — extract, resolve, dedup, induce the is-a hierarchy, govern predicates, score quality — then search it with one-shot GraphRAG. Zero infra (the whole thing runs on a pure-Python in-memory backend), zero lock-in (load into any SPARQL 1.1 store), zero hard deps in the core.

from xgen_ontology import build_from_csv

onto = build_from_csv({                       # no LLM, no DB, no API key
    "products": "product_id,name,color_id\n1,Widget,10\n2,Gadget,20",
    "colors":   "color_id,name\n10,Red\n20,Blue",
})
print(onto.stats())          # {'classes': 2, 'instances': 4, 'relations': 2, ...}
print(onto.to_turtle())      # standards RDF/Turtle
print(onto.search("what color is Widget").answer)

Build from prose with any LLM, mix tables and text freely:

from xgen_ontology import build_from_documents, CallableLLM

llm = CallableLLM(lambda p, system="": my_model(system, p))     # OpenAI / Anthropic / vLLM / …
onto = build_from_documents({
    "policy.txt": "Rule A applies to Acme Bank since 2020.",
    "colors.csv": "color_id,name\n10,Red\n20,Blue",
}, llm=llm)

Two halves of the lifecycle

Build — documents/tables → a clean graph

The pipeline is a sequence of independently-importable, backend-agnostic stages:

Stage What it does
tabular table → ontology with no LLM: table→Class, FK→ObjectProperty (same-name / normalized-name / value-overlap detection), column→DataProperty, dimension rows→instances; large fact/junction tables stay schema-only
extract one LLM call per chunk batch → schema and instances, tagged to source chunks; junk (base64/degenerate) filtered first
resolve entity resolution: fold case/whitespace/unicode + similar surface forms, guarding dates/ids and number-conflicting names
govern predicate governance: fold surface variants of a relation, anchor to the schema vocabulary
dedup merge synonymous classes/properties/instances — rule keys, LLM synonym groups, and embedding cosine clusters
hierarchy keep only genuine is-a edges ("being linked is not being a subclass"), break cycles, then SCS context profiles with property inheritance
quality a graph-reviewer score: completeness · integrity · grounding · shape
community Louvain modularity clustering (pure Python)
emit Turtle (zero-dep) or OWL/RDF-XML (rdflib)

Search — one-shot GraphRAG

Not an iterative ReAct loop — fire several retrieval strategies at once and fuse:

  1. vector / lexical passages (what it says)
  2. graph label-linking → 1-hop relations (how entities connect)
  3. class enumeration — the complete "list/count" a vector index can't give
  4. HippoRAG: entities of the retrieved chunks → 1-hop expansion
  5. evidence assembled with MMR diversity + adaptive top-k (the decisive minority — a warning/exception — survives instead of being crowded out)
  6. one LLM synthesis; honest evidence_nodes = only the nodes the answer cites
res = onto.search("which regulation applies to Acme Bank", llm=llm)
res.answer          # the synthesis
res.relations       # graph relations used
res.evidence_nodes  # nodes the answer actually cites (honest highlight)

Any graph DB, or none

The algorithms only ever talk to small protocols (GraphStore, VectorStore, LLM, GraphSink, Morphology, Embedder), never to a database:

# zero infra — pure-Python in-memory (default)
onto.search("…")

# load into any SPARQL 1.1 store (Fuseki, GraphDB, Blazegraph, Virtuoso, …)
from xgen_ontology import fuseki
store = fuseki("http://localhost:3030", "ds", user="admin", password="…")
onto.push(store, graph="urn:my-graph")           # write
onto.search("…")                                  # or search a remote store via SparqlGraph

SparqlGraph is stdlib-only (urllib) and uses portable FILTER(CONTAINS(...)), so it works on any SPARQL 1.1 endpoint — not just jena-text.

Install

pip install xgen-ontology                 # core, zero deps
pip install "xgen-ontology[rdf]"          # + rdflib (OWL / RDF-XML emit & parse)
pip install "xgen-ontology[korean]"       # + kiwipiepy (Korean morphological dedup)
pip install "xgen-ontology[vector]"       # + qdrant-client (embedding adapters)

Run the demos with no install:

python examples/build_csv.py
python examples/build_and_search.py

Design — algorithms as a library

  • dependencies = [] — the core needs nothing but the standard library. The in-memory graph indexes labels with BM25 (CJK character n-grams, so Korean/CJK search works with no morphological analyzer); the Turtle writer is hand-rolled.
  • English-neutral by default — no hardcoded language. Korean morphology, name→URI translation and the extraction/synthesis prompts are all pluggable; the defaults assume nothing about your domain or language.
  • Bring your own everything — LLM (generate(prompt, system)), embedder, morphology, graph store. The bundled EchoLLM lets the whole pipeline run with no API key.
src/xgen_ontology/
  models.py        # Class/Property/Concepts (T-Box), Instance/Relation/DataValue (A-Box), Node/Chunk
  protocols.py     # LLM / GraphStore / VectorStore / GraphSink / Morphology / Embedder
  text.py          # tokenizer + BM25 (CJK n-grams), IRI-safe slugging
  build/
    tabular.py     # table -> ontology (no LLM)
    extract.py     # document -> ontology (LLM)
    resolve.py     # entity resolution
    govern.py      # predicate governance
    dedup.py       # rule + LLM + vector dedup
    hierarchy.py   # is-a cleaning + SCS inheritance
    quality.py     # graph-reviewer score
    community.py   # Louvain
    emit.py        # Turtle / OWL
    pipeline.py    # OntologyBuilder (wires the stages)
  backends/
    memory.py      # InMemoryGraph / InMemoryVector / InMemoryGraphSink (zero infra)
    sparql.py      # SparqlGraph — any SPARQL 1.1 store (read + write)
  search/          # fusion + one-shot GraphRAG
  ontology.py      # Ontology — the hub (search / emit / push / quality / communities)
  facade.py        # build_from_csv / build_from_documents / build_from_triples
examples/  tests/

Roadmap

  • async pipeline (parallel chunk extraction + parallel search seeds)
  • Neo4j / property-graph GraphStore adapter; Qdrant VectorStore adapter
  • RDF-star / qualified statements (n-ary relations, provenance) in emit
  • reranker / cross-encoder hook for search

License

TBD.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xgen_ontology-0.1.0.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xgen_ontology-0.1.0-py3-none-any.whl (51.6 kB view details)

Uploaded Python 3

File details

Details for the file xgen_ontology-0.1.0.tar.gz.

File metadata

  • Download URL: xgen_ontology-0.1.0.tar.gz
  • Upload date:
  • Size: 46.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xgen_ontology-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5671ebc1a20350470eecb23f4a960863d58d10d619b62304644ac2f19916e70b
MD5 c2e7a59f5e4e784704aadad060570bde
BLAKE2b-256 bcf1a8bf073e3554f13f4cc04d7915185e0087597b2de5ee51ebeed8c76c0a5b

See more details on using hashes here.

File details

Details for the file xgen_ontology-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: xgen_ontology-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 51.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xgen_ontology-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 28e676a2df912a3814da2a50ec29262bea7dd82eac9ec30df2e950d91e11b76a
MD5 40c75c8bbd71daf67cfebedfe56b0077
BLAKE2b-256 6ece1bbe2d7a4b317e731d0259209e3e0d4f3ec16ff45f9f72d2f38eafbbf347

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page