A high-performance graph database library with Python bindings written in Rust
Project description
KGLite — Lightweight Knowledge Graph for Python
KGLite is an embedded knowledge graph for Python: pip install, no
server, no setup. It speaks Cypher, loads pandas DataFrames, and
ships with the connective tissue for AI agents — an MCP server so
Claude / Cursor / any MCP-capable LLM can query your graph as a
tool, a describe() method that emits a compact XML schema for
system prompts, and a code_tree parser that turns any source
directory into a graph of functions, classes, calls, and imports
across 9 languages.
Three storage modes scale from in-memory (millisecond queries on
small graphs) to mmap-backed on disk (1 B+ edges, Wikidata-scale).
Bundled dataset wrappers turn pip install kglite into a queryable
Wikidata or petroleum-domain graph in one line.
Why KGLite?
- Built for LLM agents —
describe()XML schema, bundled MCP server, an agent-oriented query surface (cypher(),graph.select(...).traverse(...)), and structural validators (CALL orphan_node({type: ...}) YIELD node) for data-integrity checks that compose with the rest of Cypher. - One-line public datasets —
wikidata.open(path)andsodir.open(path)handle fetch, parallel build, and caching; re-runs reload the cached graph instantly. - Codebase → graph in one line —
kglite.code_tree.build(".")parses Python, Rust, TypeScript, Go, Java, C#, C++, and more intoFunction/Class/Modulenodes withCALLS/DEFINES/IMPORTSedges. - Scales without leaving Python — in-memory for prototyping, mmap-backed for notebook-scale, disk-mode CSR for graphs too large for RAM. Same API across modes.
- Query with Cypher —
MATCH,MERGE,OPTIONAL MATCH, aggregations, parameters, semantic search viatext_score(). - DataFrames in, DataFrames out — bulk-load from pandas, query results as DataFrames.
Quick Start
pip install kglite
import pandas as pd
import kglite
# Three storage modes — pick by graph size:
# default (in-memory) — small/medium graphs, fastest queries
# storage="mapped" — mmap columns, RAM-friendly as you grow
# storage="disk", path=… — 100M+ nodes, Wikidata-scale, loaded lazily
graph = kglite.KnowledgeGraph()
# Bulk-load nodes from a DataFrame (also: add_nodes_bulk, from_blueprint,
# load_ntriples, or Cypher CREATE for ad-hoc inserts).
people = pd.DataFrame({
"id": ["alice", "bob", "eve"],
"name": ["Alice", "Bob", "Eve"],
"age": [28, 35, 41],
"city": ["Oslo", "Bergen", "Trondheim"],
})
graph.add_nodes(people, node_type="Person", unique_id_field="id", node_title_field="name")
# Bulk-load relationships the same way (also: add_connections_bulk,
# add_connections_from_source for auto-filter by loaded types).
knows = pd.DataFrame({"src": ["alice", "bob"], "tgt": ["bob", "eve"]})
graph.add_connections(knows, connection_type="KNOWS",
source_type="Person", source_id_field="src",
target_type="Person", target_id_field="tgt")
# Query — returns a ResultView (lazy; data stays in Rust until accessed).
result = graph.cypher("""
MATCH (p:Person) WHERE p.age > 30
RETURN p.name AS name, p.city AS city
ORDER BY p.age DESC
""")
for row in result:
print(row['name'], row['city'])
# Or get a pandas DataFrame directly.
df = graph.cypher("MATCH (p:Person) RETURN p.name, p.age ORDER BY p.age", to_df=True)
# Persist to disk and reload.
graph.save("my_graph.kgl")
loaded = kglite.load("my_graph.kgl")
Try it instantly: ready-to-query datasets
Two bundled wrappers turn well-known public sources into queryable
graphs without writing a loader. Each call handles the fetch +
build + cache cycle, returns a KnowledgeGraph you can cypher()
against, and respects a per-dataset cooldown so re-running just
loads the cached graph in seconds. KGLite is independent of the
upstream organisations — see each module docstring for
non-affiliation notes.
Wikidata
Single-stream latest-truthy.nt.bz2 from
dumps.wikimedia.org —
parallel-decoded with a bit-level block scanner, parsed, built into a
queryable graph in one call:
from kglite.datasets import wikidata
g = wikidata.open("/data/wd") # full graph
g = wikidata.open("/data/wd", entity_limit_millions=100) # 100M slice
g = wikidata.open("/data/wd", storage="memory", # in-memory, fast tests
entity_limit_millions=10)
Sodir (Norwegian Offshore Directorate)
Petroleum-domain graph from the public ArcGIS REST FeatureServer at factmaps.sodir.no — 33 baseline node types (Field, Wellbore, Discovery, Licence, Stratigraphy, …), ~480 k nodes, parallel-fetched and built in seconds:
from kglite.datasets import sodir
g = sodir.open("/data/sodir") # in-memory by default; ~30s first run
g = sodir.open("/data/sodir", complement_blueprint="my_extras.json") # extend
Two-tier cooldown — cheap row-count probes every 14 days; full per-dataset re-fetch every 30 days. Add a complement blueprint to extend the baseline (new node types, custom edges) without touching the canonical schema; the file is persisted into the workdir on first use and auto-loaded after.
Use Cases
Agentic AI — memory and tool use
Give an LLM a structured memory it can query. describe() emits a
compact XML schema that fits in a system prompt, and the bundled MCP
server exposes the whole graph as a Cypher tool — drop-in for Claude,
Cursor, or any MCP-capable agent.
xml = graph.describe() # schema for the agent's context
prompt = f"You have a knowledge graph:\n{xml}\nAnswer via graph.cypher()."
# Or: python examples/mcp_server.py path/to/graph.kgl
Codebase analysis
Parse Python, Rust, TypeScript, Go, Java, C#, and C++ into a graph of functions, classes, calls, and imports. Trace who-calls-what, find dead code, and review structure without leaving your editor. Pairs naturally with the MCP server so an agent can reason over your repo.
from kglite.code_tree import build
graph = build(".") # parse current directory
graph.cypher("""
MATCH (f:Function)-[:CALLS]->(g:Function)
RETURN g.name, count(f) AS callers
ORDER BY callers DESC LIMIT 10
""")
RAG retrieval
Store documents, chunks, and entities together as one graph. Combine
text_score() semantic similarity with Cypher structure — hybrid
retrieval in one query, no second vector DB.
graph.cypher("""
MATCH (c:Chunk)-[:IN_DOC]->(d:Document)
RETURN c.text, d.title,
text_score(c.embedding, $query_vec) AS score
ORDER BY score DESC LIMIT 5
""", params={"query_vec": query_embedding})
Data exploration and analysis
Load CSVs or DataFrames, walk relationships, run graph algorithms (shortest path, centrality, community detection), and export — all from a notebook.
graph.add_nodes(users_df, node_type="User", unique_id_field="user_id", node_title_field="name")
graph.cypher("""
MATCH path = shortestPath((a:User {name:'Alice'})-[*]-(b:User {name:'Eve'}))
RETURN path
""")
Structural validators — surface data-integrity gaps in one query
Six built-in CALL procedures find the gaps that aren't visible
from normal queries: nodes with zero edges, missing-required-edge
violations, two-step cycles, duplicate titles, more. They compose
with the rest of Cypher — feed the output into WHERE, ORDER BY,
or downstream aggregation in a single pass.
# Wellbores in our sodir graph that lack a production licence
graph.cypher("""
CALL missing_required_edge({type: 'Wellbore', edge: 'IN_LICENCE'}) YIELD node
RETURN node.id, node.title
""") # 502 violations on the Sodir April-2026 snapshot
# Cross-reference flagged IDs against any query result, in one Cypher pass
graph.cypher("""
MATCH (l:Licence {title: '057'})<-[:IN_LICENCE]-(w:Wellbore)
WITH collect(w.id) AS pl057
CALL missing_required_edge({type: 'Wellbore', edge: 'DRILLED_BY'}) YIELD node
WHERE node.id IN pl057
RETURN count(*) AS pl057_missing_drilled_by
""")
missing_required_edge and missing_inbound_edge validate the
(type, edge) direction against the graph's actual schema and
refuse to execute when misused. See
docs/guides/cypher.md
for the full procedure list.
Examples
The examples/
directory has runnable, self-contained scripts covering each of the
use cases above:
code_graph.py— build a code knowledge graph from a source directory viacode_tree.build. ProducesFunction,Class,Module,Filenodes withCALLS,DEFINES,IMPORTSedges.legal_graph.py— end-to-endadd_nodes/add_connectionsfrom pandas DataFrames, covering laws, regulations, and court decisions with citation relationships. Good template for adapting to your own domain.mcp_server.py— drop-in MCP server that exposes any.kglfile to an LLM (Claude, Cursor, …) as a Cypher query tool, with schema disclosure and code-graph–aware helpers.spatial_graph.py— declarative CSV→graph loading via a JSON blueprint; regions, facilities, and sensors with lat/lon coordinates and pipeline-path traversal queries.
For Wikidata- and Sodir-scale builds, see the Public datasets
section above — kglite.datasets.wikidata.open(...) and
kglite.datasets.sodir.open(...) cover those workflows in one call.
Benchmarks
KGLite builds and queries Wikidata-scale graphs on a laptop.
Measured with
bench/wiki_benchmark.py
on an M-series MacBook.
Ingest — full pipeline from compressed N-Triples to a queryable graph:
| dataset | triples | nodes | edges | ingest | throughput | peak RAM |
|---|---|---|---|---|---|---|
| wiki100m | 100 M | 938 K | 748 K | 29 s | 3.4 M triples/s | 1.3 GB |
| wiki500m | 500 M | 5.6 M | 6.7 M | 157 s | 3.2 M triples/s | 5.2 GB |
| wiki1000m | 1 B | 14.7 M | 15.4 M | 395 s | 2.5 M triples/s | 7.0 GB |
Reloading a saved 1 B-triple graph from disk (7 GB on-disk): 3.5 s.
Query latency on the 1 B-triple graph (mapped storage). Type names
match the labels Wikidata ships per language — with languages=["en"]
(the default), Q5 is renamed to human:
| Cypher | wall |
|---|---|
MATCH (n)-[:P31]->(:human) RETURN count(n) — typed aggregation |
0.5 ms |
MATCH (a)-[:P31]->(b)-[:P279]->(c) LIMIT 10 — 2-hop typed |
0.9 ms |
MATCH (a)-[:P31]->(b {nid:'Q64'}) RETURN a LIMIT 20 — pivot |
1 ms |
MATCH (a)-[:P31]->(:human) MATCH (a)-[:P27]->(c) LIMIT 10 — join |
44 ms |
Disk and mapped storage track within 1 % on build; mapped wins on query shapes backed by its in-memory inverted index, disk wins on unbounded typed traversals by staying on sorted-CSR mmap I/O.
No server, no tuning, same Python process as your code.
Key Features
| Feature | Description |
|---|---|
| Cypher queries | MATCH, CREATE, SET, DELETE, MERGE, UNION/INTERSECT/EXCEPT, aggregations (incl. median, percentile_cont, variance), reduce(), ORDER BY, LIMIT, SKIP |
| Semantic search | Vector embeddings + text_score() for similarity ranking |
| Text predicates | text_edit_distance, text_normalize, text_jaccard, text_ngrams, text_contains_any / text_starts_with_any for fuzzy match |
| Graph algorithms | Shortest path (BFS or Dijkstra via weight_property), centrality, community detection, clustering |
| Structural validators | 14 CALL procedures: orphan_node, missing_required_edge, cycle_2step, inverse_violation, transitivity_violation, cardinality_violation, parallel_edges, null_property, type_domain/range_violation, etc. — agent-discoverable integrity checks composable with normal Cypher |
| Spatial | Coordinates, WKT geometry, distance + containment, geometry primitives (geom_buffer, geom_convex_hull, geom_union/intersection/difference, geom_is_valid, geom_length), kg_knn k-nearest-neighbour |
| Timeseries | Time-indexed data with ts_*() Cypher functions |
| Bulk loading | Fluent API (add_nodes / add_connections) for DataFrames |
| Blueprints | Declarative CSV-to-graph loading via JSON config |
| Import/Export | Save/load snapshots, GraphML, CSV export |
| AI integration | describe() introspection, MCP server, agent prompts |
| Code analysis | Parse codebases via tree-sitter (kglite.code_tree) |
Documentation
Full docs at kglite.readthedocs.io:
- Getting Started — installation, first graph, core concepts
- Cypher Guide — queries, mutations, parameters
- Semantic Search — embeddings, vector search
- AI Agents — MCP server,
describe(), agent prompts - API Reference — full auto-generated reference
Requirements
Python 3.10+ (CPython) | macOS (ARM/Intel), Linux (x86_64/aarch64), Windows (x86_64) | pandas >= 1.5
License
MIT — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kglite-0.8.20-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: kglite-0.8.20-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 6.4 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b734139ecc1e87fbd5db55a8dbedbe7473b282c7d201d4eaadf24cce8c617021
|
|
| MD5 |
2232d271ae62edb7fc47cdd42daf1ac3
|
|
| BLAKE2b-256 |
9e2dfb84d5b7fcb40b29d03144199e1dd0d48fb5e51bb3611f571fd8c88673db
|
File details
Details for the file kglite-0.8.20-cp310-abi3-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: kglite-0.8.20-cp310-abi3-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 6.5 MB
- Tags: CPython 3.10+, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba64500dbf6b7cc320877508fd35b7ddf25e1de2e7dca55451cc1f6ab1d693c5
|
|
| MD5 |
1a15bd5098e017c870c2c29cb33e5f23
|
|
| BLAKE2b-256 |
60f79a89bfe044d3e30b339aad9446be5620842e3205fddaa28997fbca6884fb
|
File details
Details for the file kglite-0.8.20-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: kglite-0.8.20-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 5.8 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
16d19ac67f4083d176dead9e120f86d69dafda43dfd7d35168151a77b2cdc2da
|
|
| MD5 |
bd4d314d353057ecaa83023c3e09b592
|
|
| BLAKE2b-256 |
e07df14cf1b347d63b4c46e3f7fb06d9d5a33b988e36bc537f2cc4e35383e2d0
|
File details
Details for the file kglite-0.8.20-cp310-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: kglite-0.8.20-cp310-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 6.3 MB
- Tags: CPython 3.10+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8c18b56d6b4bf9567de853a70d5fd4fcdf3c38d690b4a1740fe7ea70e12918c
|
|
| MD5 |
a8ad624c6fe2fcaa134ffdf00fd55f89
|
|
| BLAKE2b-256 |
6887c3dc49004205998a4e7be95d804828c1d69ba18cddba00d3665b822e065e
|