Backend-agnostic ontology / knowledge-graph toolkit — build a clean KG from documents or tables, then search it with one-shot GraphRAG. Zero infra, any graph DB.
Project description
xgen-ontology
Backend-agnostic ontology / knowledge-graph toolkit. Turn documents or tables into a clean knowledge graph — extract, resolve, dedup, induce the is-a hierarchy, govern predicates, score quality — then search it with one-shot GraphRAG. Zero infra (the whole thing runs on a pure-Python in-memory backend), zero lock-in (load into any SPARQL 1.1 store), zero hard deps in the core.
from xgen_ontology import build_from_csv
onto = build_from_csv({ # no LLM, no DB, no API key
"products": "product_id,name,color_id\n1,Widget,10\n2,Gadget,20",
"colors": "color_id,name\n10,Red\n20,Blue",
})
print(onto.stats()) # {'classes': 2, 'instances': 4, 'relations': 2, ...}
print(onto.to_turtle()) # standards RDF/Turtle
print(onto.search("what color is Widget").answer)
Build from prose with any LLM, mix tables and text freely:
from xgen_ontology import build_from_documents, CallableLLM
llm = CallableLLM(lambda p, system="": my_model(system, p)) # OpenAI / Anthropic / vLLM / …
onto = build_from_documents({
"policy.txt": "Rule A applies to Acme Bank since 2020.",
"colors.csv": "color_id,name\n10,Red\n20,Blue",
}, llm=llm)
Two halves of the lifecycle
Build — documents/tables → a clean graph
The pipeline is a sequence of independently-importable, backend-agnostic stages:
| Stage | What it does |
|---|---|
| tabular | table → ontology with no LLM: table→Class, FK→ObjectProperty (same-name / normalized-name / value-overlap detection), column→DataProperty, dimension rows→instances; large fact/junction tables stay schema-only |
| extract | one LLM call per chunk batch → schema and instances, tagged to source chunks; junk (base64/degenerate) filtered first |
| resolve | entity resolution: fold case/whitespace/unicode + similar surface forms, guarding dates/ids and number-conflicting names |
| govern | predicate governance: fold surface variants of a relation, anchor to the schema vocabulary |
| dedup | merge synonymous classes/properties/instances — rule keys, LLM synonym groups, and embedding cosine clusters |
| hierarchy | keep only genuine is-a edges ("being linked is not being a subclass"), break cycles, then SCS context profiles with property inheritance |
| quality | a graph-reviewer score: completeness · integrity · grounding · shape |
| community | Louvain modularity clustering (pure Python) |
| emit | Turtle (zero-dep) or OWL/RDF-XML (rdflib) |
Search — one-shot GraphRAG
Not an iterative ReAct loop — fire several retrieval strategies at once and fuse:
- vector / lexical passages (what it says)
- graph label-linking → 1-hop relations (how entities connect)
- class enumeration — the complete "list/count" a vector index can't give
- HippoRAG: entities of the retrieved chunks → 1-hop expansion
- evidence assembled with MMR diversity + adaptive top-k (the decisive minority — a warning/exception — survives instead of being crowded out)
- one LLM synthesis; honest
evidence_nodes= only the nodes the answer cites
res = onto.search("which regulation applies to Acme Bank", llm=llm)
res.answer # the synthesis
res.relations # graph relations used
res.evidence_nodes # nodes the answer actually cites (honest highlight)
Any graph DB, or none
The algorithms only ever talk to small protocols (GraphStore, VectorStore,
LLM, GraphSink, Morphology, Embedder), never to a database:
# zero infra — pure-Python in-memory (default)
onto.search("…")
# load into any SPARQL 1.1 store (Fuseki, GraphDB, Blazegraph, Virtuoso, …)
from xgen_ontology import fuseki
store = fuseki("http://localhost:3030", "ds", user="admin", password="…")
onto.push(store, graph="urn:my-graph") # write
onto.search("…") # or search a remote store via SparqlGraph
SparqlGraph is stdlib-only (urllib) and uses portable FILTER(CONTAINS(...)), so
it works on any SPARQL 1.1 endpoint — not just jena-text.
Install
pip install xgen-ontology # core, zero deps
pip install "xgen-ontology[rdf]" # + rdflib (OWL / RDF-XML emit & parse)
pip install "xgen-ontology[korean]" # + kiwipiepy (Korean morphological dedup)
pip install "xgen-ontology[vector]" # + qdrant-client (embedding adapters)
Run the demos with no install:
python examples/build_csv.py
python examples/build_and_search.py
Design — algorithms as a library
dependencies = []— the core needs nothing but the standard library. The in-memory graph indexes labels with BM25 (CJK character n-grams, so Korean/CJK search works with no morphological analyzer); the Turtle writer is hand-rolled.- English-neutral by default — no hardcoded language. Korean morphology, name→URI translation and the extraction/synthesis prompts are all pluggable; the defaults assume nothing about your domain or language.
- Bring your own everything — LLM (
generate(prompt, system)), embedder, morphology, graph store. The bundledEchoLLMlets the whole pipeline run with no API key.
src/xgen_ontology/
models.py # Class/Property/Concepts (T-Box), Instance/Relation/DataValue (A-Box), Node/Chunk
protocols.py # LLM / GraphStore / VectorStore / GraphSink / Morphology / Embedder
text.py # tokenizer + BM25 (CJK n-grams), IRI-safe slugging
build/
tabular.py # table -> ontology (no LLM)
extract.py # document -> ontology (LLM)
resolve.py # entity resolution
govern.py # predicate governance
dedup.py # rule + LLM + vector dedup
hierarchy.py # is-a cleaning + SCS inheritance
quality.py # graph-reviewer score
community.py # Louvain
emit.py # Turtle / OWL
pipeline.py # OntologyBuilder (wires the stages)
backends/
memory.py # InMemoryGraph / InMemoryVector / InMemoryGraphSink (zero infra)
sparql.py # SparqlGraph — any SPARQL 1.1 store (read + write)
search/ # fusion + one-shot GraphRAG
ontology.py # Ontology — the hub (search / emit / push / quality / communities)
facade.py # build_from_csv / build_from_documents / build_from_triples
examples/ tests/
Roadmap
- async pipeline (parallel chunk extraction + parallel search seeds)
- Neo4j / property-graph
GraphStoreadapter; QdrantVectorStoreadapter - RDF-star / qualified statements (n-ary relations, provenance) in emit
- reranker / cross-encoder hook for search
License
TBD.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xgen_ontology-0.1.0.tar.gz.
File metadata
- Download URL: xgen_ontology-0.1.0.tar.gz
- Upload date:
- Size: 46.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5671ebc1a20350470eecb23f4a960863d58d10d619b62304644ac2f19916e70b
|
|
| MD5 |
c2e7a59f5e4e784704aadad060570bde
|
|
| BLAKE2b-256 |
bcf1a8bf073e3554f13f4cc04d7915185e0087597b2de5ee51ebeed8c76c0a5b
|
File details
Details for the file xgen_ontology-0.1.0-py3-none-any.whl.
File metadata
- Download URL: xgen_ontology-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28e676a2df912a3814da2a50ec29262bea7dd82eac9ec30df2e950d91e11b76a
|
|
| MD5 |
40c75c8bbd71daf67cfebedfe56b0077
|
|
| BLAKE2b-256 |
6ece1bbe2d7a4b317e731d0259209e3e0d4f3ec16ff45f9f72d2f38eafbbf347
|