Backend-agnostic ontology / knowledge-graph toolkit — parse + chunk documents (or tables), build a clean KG, then search it with one-shot GraphRAG. Zero infra, any graph DB.

These details have not been verified by PyPI

Project links

Project description

xgen-ontology

Backend-agnostic ontology / knowledge-graph toolkit. Turn documents or tables into a clean knowledge graph — extract, resolve, dedup, induce the is-a hierarchy, govern predicates, score quality — then search it with one-shot GraphRAG. Zero infra (the whole thing runs on a pure-Python in-memory backend), zero lock-in (load into any SPARQL 1.1 store), zero hard deps in the core.

from xgen_ontology import build_from_csv

onto = build_from_csv({                       # no LLM, no DB, no API key
    "products": "product_id,name,color_id\n1,Widget,10\n2,Gadget,20",
    "colors":   "color_id,name\n10,Red\n20,Blue",
})
print(onto.stats())          # {'classes': 2, 'instances': 4, 'relations': 2, ...}
print(onto.to_turtle())      # standards RDF/Turtle
print(onto.search("what color is Widget").answer)

Build from prose with any LLM, mix tables and text freely — raw documents are parsed and chunked for you:

from xgen_ontology import build_from_files, build_from_text, CallableLLM

llm = CallableLLM(lambda p, system="": my_model(system, p))     # OpenAI / Anthropic / vLLM / …

# from files on disk (txt/md/html/csv built-in; pdf/docx/xlsx via the [files] extra)
onto = build_from_files(["policy.pdf", "products.csv"], llm=llm)

# or from a single raw string (auto boundary-aware chunking)
onto = build_from_text("Rule A applies to Acme Bank since 2020. ...", llm=llm)

Two halves of the lifecycle

Build — documents/tables → a clean graph

The pipeline is a sequence of independently-importable, backend-agnostic stages:

Stage	What it does
parse	extract text from files — txt/md/html/csv built-in (zero-dep), pdf/docx/xlsx via `[files]`
chunk	boundary-aware chunking (paragraph→sentence→char) with overlap, stable chunk ids for provenance
tabular	table → ontology with no LLM: table→Class, FK→ObjectProperty (same-name / normalized-name / value-overlap detection), column→DataProperty, dimension rows→instances; large fact/junction tables stay schema-only
extract	one LLM call per chunk batch → schema and instances, tagged to source chunks; junk (base64/degenerate) filtered first
resolve	entity resolution: fold case/whitespace/unicode + similar surface forms, guarding dates/ids and number-conflicting names
govern	predicate governance: fold surface variants of a relation, anchor to the schema vocabulary
dedup	merge synonymous classes/properties/instances — rule keys, LLM synonym groups, and embedding cosine clusters
hierarchy	keep only genuine is-a edges ("being linked is not being a subclass"), break cycles, then SCS context profiles with property inheritance
quality	a graph-reviewer score: completeness · integrity · grounding · shape
community	Louvain modularity clustering (pure Python)
emit	Turtle (zero-dep) or OWL/RDF-XML (rdflib)

Search — one-shot GraphRAG

Not an iterative ReAct loop — fire several retrieval strategies at once and fuse:

vector / lexical passages (what it says)
graph label-linking → 1-hop relations (how entities connect)
class enumeration — the complete "list/count" a vector index can't give
HippoRAG: entities of the retrieved chunks → 1-hop expansion
evidence assembled with MMR diversity + adaptive top-k (the decisive minority — a warning/exception — survives instead of being crowded out)
one LLM synthesis; honest evidence_nodes = only the nodes the answer cites

res = onto.search("which regulation applies to Acme Bank", llm=llm)
res.answer          # the synthesis
res.relations       # graph relations used
res.evidence_nodes  # nodes the answer actually cites (honest highlight)

Any graph DB, or none

The algorithms only ever talk to small protocols (GraphStore, VectorStore, LLM, GraphSink, Morphology, Embedder), never to a database:

# zero infra — pure-Python in-memory (default)
onto.search("…")

# load into any SPARQL 1.1 store (Fuseki, GraphDB, Blazegraph, Virtuoso, …)
from xgen_ontology import fuseki
store = fuseki("http://localhost:3030", "ds", user="admin", password="…")
onto.push(store, graph="urn:my-graph")           # write
onto.search("…")                                  # or search a remote store via SparqlGraph

SparqlGraph is stdlib-only (urllib) and uses portable FILTER(CONTAINS(...)), so it works on any SPARQL 1.1 endpoint — not just jena-text.

Install

pip install xgen-ontology                 # core, zero deps
pip install "xgen-ontology[files]"        # + pypdf / python-docx / openpyxl (parse pdf/docx/xlsx)
pip install "xgen-ontology[rdf]"          # + rdflib (OWL / RDF-XML emit & parse)
pip install "xgen-ontology[korean]"       # + kiwipiepy (Korean morphological dedup)
pip install "xgen-ontology[vector]"       # + qdrant-client (embedding adapters)

Run the demos with no install:

python examples/build_csv.py
python examples/build_and_search.py

Design — algorithms as a library

dependencies = [] — the core needs nothing but the standard library. The in-memory graph indexes labels with BM25 (CJK character n-grams, so Korean/CJK search works with no morphological analyzer); the Turtle writer is hand-rolled.
English-neutral by default — no hardcoded language. Korean morphology, name→URI translation and the extraction/synthesis prompts are all pluggable; the defaults assume nothing about your domain or language.
Bring your own everything — LLM (generate(prompt, system)), embedder, morphology, graph store. The bundled EchoLLM lets the whole pipeline run with no API key.

src/xgen_ontology/
  models.py        # Class/Property/Concepts (T-Box), Instance/Relation/DataValue (A-Box), Node/Chunk
  protocols.py     # LLM / GraphStore / VectorStore / GraphSink / Morphology / Embedder
  text.py          # tokenizer + BM25 (CJK n-grams), IRI-safe slugging
  build/
    parse.py       # file -> text (txt/md/html/csv; pdf/docx/xlsx optional)
    chunk.py       # boundary-aware chunking
    tabular.py     # table -> ontology (no LLM)
    extract.py     # document -> ontology (LLM)
    resolve.py     # entity resolution
    govern.py      # predicate governance
    dedup.py       # rule + LLM + vector dedup
    hierarchy.py   # is-a cleaning + SCS inheritance
    quality.py     # graph-reviewer score
    community.py   # Louvain
    emit.py        # Turtle / OWL
    pipeline.py    # OntologyBuilder (wires the stages)
  backends/
    memory.py      # InMemoryGraph / InMemoryVector / InMemoryGraphSink (zero infra)
    sparql.py      # SparqlGraph — any SPARQL 1.1 store (read + write)
  search/          # fusion + one-shot GraphRAG
  ontology.py      # Ontology — the hub (search / emit / push / quality / communities)
  facade.py        # build_from_csv / build_from_documents / build_from_triples
examples/  tests/

Roadmap

async pipeline (parallel chunk extraction + parallel search seeds)
Neo4j / property-graph GraphStore adapter; Qdrant VectorStore adapter
RDF-star / qualified statements (n-ary relations, provenance) in emit
reranker / cross-encoder hook for search

License

MIT © jinsoo96. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jun 23, 2026

0.1.0

Jun 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xgen_ontology-0.2.0.tar.gz (50.7 kB view details)

Uploaded Jun 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xgen_ontology-0.2.0-py3-none-any.whl (56.2 kB view details)

Uploaded Jun 23, 2026 Python 3

File details

Details for the file xgen_ontology-0.2.0.tar.gz.

File metadata

Download URL: xgen_ontology-0.2.0.tar.gz
Upload date: Jun 23, 2026
Size: 50.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xgen_ontology-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ffb6dc3e7862b49b3e7d1397b01419e2bfc6638a683aeae8bd7560a00cd24a6f`
MD5	`29170a63493372d40bdf11aea72c0034`
BLAKE2b-256	`de9cbafbde88a4c8899a8a557e9915a9e53e6036a0fffba9e12d77d024ae9e32`

See more details on using hashes here.

File details

Details for the file xgen_ontology-0.2.0-py3-none-any.whl.

File metadata

Download URL: xgen_ontology-0.2.0-py3-none-any.whl
Upload date: Jun 23, 2026
Size: 56.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for xgen_ontology-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f4b4210279760bc08a1207ec9ccdfef86f2b41ad721f6dcc88f9facd32851844`
MD5	`7655db0724a82a6ffa2f60eac5046541`
BLAKE2b-256	`721372da353372b92f2e43a856606b7b4605532f4e8f54bcfa481aff345bdaaf`

See more details on using hashes here.

xgen-ontology 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

xgen-ontology

Two halves of the lifecycle

Build — documents/tables → a clean graph

Search — one-shot GraphRAG

Any graph DB, or none

Install

Design — algorithms as a library

Roadmap

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes