Skip to main content

Backend-agnostic ontology / knowledge-graph toolkit — parse + chunk documents (or tables), build a clean KG, then search it with one-shot GraphRAG. Zero infra, any graph DB.

Project description

xgen-ontology

Backend-agnostic ontology / knowledge-graph toolkit. Turn documents or tables into a clean knowledge graph — extract, resolve, dedup, induce the is-a hierarchy, govern predicates, score quality — then search it with one-shot GraphRAG. Zero infra (the whole thing runs on a pure-Python in-memory backend), zero lock-in (load into any SPARQL 1.1 store), zero hard deps in the core.

from xgen_ontology import build_from_csv

onto = build_from_csv({                       # no LLM, no DB, no API key
    "products": "product_id,name,color_id\n1,Widget,10\n2,Gadget,20",
    "colors":   "color_id,name\n10,Red\n20,Blue",
})
print(onto.stats())          # {'classes': 2, 'instances': 4, 'relations': 2, ...}
print(onto.to_turtle())      # standards RDF/Turtle
print(onto.search("what color is Widget").answer)

Build from prose with any LLM, mix tables and text freely — raw documents are parsed and chunked for you:

from xgen_ontology import build_from_files, build_from_text, CallableLLM

llm = CallableLLM(lambda p, system="": my_model(system, p))     # OpenAI / Anthropic / vLLM / …

# from files on disk (txt/md/html/csv built-in; pdf/docx/xlsx via the [files] extra)
onto = build_from_files(["policy.pdf", "products.csv"], llm=llm)

# or from a single raw string (auto boundary-aware chunking)
onto = build_from_text("Rule A applies to Acme Bank since 2020. ...", llm=llm)

Two halves of the lifecycle

Build — documents/tables → a clean graph

The pipeline is a sequence of independently-importable, backend-agnostic stages:

Stage What it does
parse extract text from files — txt/md/html/csv built-in (zero-dep), pdf/docx/xlsx via [files]
chunk boundary-aware chunking (paragraph→sentence→char) with overlap, stable chunk ids for provenance
tabular table → ontology with no LLM: table→Class, FK→ObjectProperty (same-name / normalized-name / value-overlap detection), column→DataProperty, dimension rows→instances; large fact/junction tables stay schema-only
extract one LLM call per chunk batch → schema and instances, tagged to source chunks; junk (base64/degenerate) filtered first
resolve entity resolution: fold case/whitespace/unicode + similar surface forms, guarding dates/ids and number-conflicting names
govern predicate governance: fold surface variants of a relation, anchor to the schema vocabulary
dedup merge synonymous classes/properties/instances — rule keys, LLM synonym groups, and embedding cosine clusters
hierarchy keep only genuine is-a edges ("being linked is not being a subclass"), break cycles, then SCS context profiles with property inheritance
quality a graph-reviewer score: completeness · integrity · grounding · shape
community Louvain modularity clustering (pure Python)
emit Turtle (zero-dep) or OWL/RDF-XML (rdflib)

Search — one-shot GraphRAG

Not an iterative ReAct loop — fire several retrieval strategies at once and fuse:

  1. vector / lexical passages (what it says)
  2. graph label-linking → 1-hop relations (how entities connect)
  3. class enumeration — the complete "list/count" a vector index can't give
  4. HippoRAG: entities of the retrieved chunks → 1-hop expansion
  5. evidence assembled with MMR diversity + adaptive top-k (the decisive minority — a warning/exception — survives instead of being crowded out)
  6. one LLM synthesis; honest evidence_nodes = only the nodes the answer cites
res = onto.search("which regulation applies to Acme Bank", llm=llm)
res.answer          # the synthesis
res.relations       # graph relations used
res.evidence_nodes  # nodes the answer actually cites (honest highlight)

Any graph DB, or none

The algorithms only ever talk to small protocols (GraphStore, VectorStore, LLM, GraphSink, Morphology, Embedder), never to a database:

# zero infra — pure-Python in-memory (default)
onto.search("…")

# load into any SPARQL 1.1 store (Fuseki, GraphDB, Blazegraph, Virtuoso, …)
from xgen_ontology import fuseki
store = fuseki("http://localhost:3030", "ds", user="admin", password="…")
onto.push(store, graph="urn:my-graph")           # write
onto.search("…")                                  # or search a remote store via SparqlGraph

SparqlGraph is stdlib-only (urllib) and uses portable FILTER(CONTAINS(...)), so it works on any SPARQL 1.1 endpoint — not just jena-text.

Install

pip install xgen-ontology                 # core, zero deps
pip install "xgen-ontology[files]"        # + pypdf / python-docx / openpyxl (parse pdf/docx/xlsx)
pip install "xgen-ontology[rdf]"          # + rdflib (OWL / RDF-XML emit & parse)
pip install "xgen-ontology[korean]"       # + kiwipiepy (Korean morphological dedup)
pip install "xgen-ontology[vector]"       # + qdrant-client (embedding adapters)

Run the demos with no install:

python examples/build_csv.py
python examples/build_and_search.py

Design — algorithms as a library

  • dependencies = [] — the core needs nothing but the standard library. The in-memory graph indexes labels with BM25 (CJK character n-grams, so Korean/CJK search works with no morphological analyzer); the Turtle writer is hand-rolled.
  • English-neutral by default — no hardcoded language. Korean morphology, name→URI translation and the extraction/synthesis prompts are all pluggable; the defaults assume nothing about your domain or language.
  • Bring your own everything — LLM (generate(prompt, system)), embedder, morphology, graph store. The bundled EchoLLM lets the whole pipeline run with no API key.
src/xgen_ontology/
  models.py        # Class/Property/Concepts (T-Box), Instance/Relation/DataValue (A-Box), Node/Chunk
  protocols.py     # LLM / GraphStore / VectorStore / GraphSink / Morphology / Embedder
  text.py          # tokenizer + BM25 (CJK n-grams), IRI-safe slugging
  build/
    parse.py       # file -> text (txt/md/html/csv; pdf/docx/xlsx optional)
    chunk.py       # boundary-aware chunking
    tabular.py     # table -> ontology (no LLM)
    extract.py     # document -> ontology (LLM)
    resolve.py     # entity resolution
    govern.py      # predicate governance
    dedup.py       # rule + LLM + vector dedup
    hierarchy.py   # is-a cleaning + SCS inheritance
    quality.py     # graph-reviewer score
    community.py   # Louvain
    emit.py        # Turtle / OWL
    pipeline.py    # OntologyBuilder (wires the stages)
  backends/
    memory.py      # InMemoryGraph / InMemoryVector / InMemoryGraphSink (zero infra)
    sparql.py      # SparqlGraph — any SPARQL 1.1 store (read + write)
  search/          # fusion + one-shot GraphRAG
  ontology.py      # Ontology — the hub (search / emit / push / quality / communities)
  facade.py        # build_from_csv / build_from_documents / build_from_triples
examples/  tests/

Roadmap

  • async pipeline (parallel chunk extraction + parallel search seeds)
  • Neo4j / property-graph GraphStore adapter; Qdrant VectorStore adapter
  • RDF-star / qualified statements (n-ary relations, provenance) in emit
  • reranker / cross-encoder hook for search

License

MIT © jinsoo96. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xgen_ontology-0.2.0.tar.gz (50.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xgen_ontology-0.2.0-py3-none-any.whl (56.2 kB view details)

Uploaded Python 3

File details

Details for the file xgen_ontology-0.2.0.tar.gz.

File metadata

  • Download URL: xgen_ontology-0.2.0.tar.gz
  • Upload date:
  • Size: 50.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xgen_ontology-0.2.0.tar.gz
Algorithm Hash digest
SHA256 ffb6dc3e7862b49b3e7d1397b01419e2bfc6638a683aeae8bd7560a00cd24a6f
MD5 29170a63493372d40bdf11aea72c0034
BLAKE2b-256 de9cbafbde88a4c8899a8a557e9915a9e53e6036a0fffba9e12d77d024ae9e32

See more details on using hashes here.

File details

Details for the file xgen_ontology-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: xgen_ontology-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 56.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xgen_ontology-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f4b4210279760bc08a1207ec9ccdfef86f2b41ad721f6dcc88f9facd32851844
MD5 7655db0724a82a6ffa2f60eac5046541
BLAKE2b-256 721372da353372b92f2e43a856606b7b4605532f4e8f54bcfa481aff345bdaaf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page