A high-performance graph database library with Python bindings written in Rust
Project description
KGLite — Knowledge graph for Python, built for LLM agents
KGLite is an embedded, Cypher-queryable knowledge graph for Python,
built so you can hand it to an LLM agent. pip install kglite and
point kglite.code_tree.build(".") at any source directory — your
first queryable graph in seconds. It ships with a bundled MCP server,
a describe() method that emits a system-prompt-shaped schema, and
structural validators that compose with Cypher.
Codebase → Claude
examples/codebase_to_claude_mcp.ipynbclones a GitHub repo, parses it into a code knowledge graph, and registers a workspace MCP server in Claude Desktop.
SEC filings → graph
from kglite.datasets.sec import SEC g = SEC.fetch("./sec", "13F-HR", "TSLA", years=2, user_agent="Your Name your@email.com")
SEC.fetchdownloads the named forms for the named companies and returns a Cypher-queryable graph — Form 4 insider transactions, 13F holdings, SC 13D stakes, DEF 14A board composition, 8-K events. →examples/sec_to_claude_mcp.ipynb· SEC guide.
Use cases
The same agent-facing surface works whether the graph holds legal precedents, a Wikidata slice, a SQL warehouse, a RAG corpus, or a parsed codebase.
- 🏦 SEC EDGAR.
SEC.fetch(path, forms, companies, years=2)builds a US-public-company graph from the SEC's free data: insider transactions (Form 4), institutional holdings (13F), activist stakes (SC 13D), board composition (DEF 14A), 8-K events — with XBRL financials and Exhibit 21 subsidiaries viaSEC.open. → SEC guide. - 🏛️ Domain knowledge for agents. Legal precedents + citations, regulatory rules, medical ontologies, manufacturing BOMs, scientific catalogues — anything with structure becomes a queryable graph an MCP-capable agent can reason over. See the legal-graph example for a Norwegian-Supreme-Court walk-through (laws + decisions + citation edges + judge metadata).
- 📊 Business data → queryable graph. Any tabular source — SQL,
CSV, Parquet, REST API responses, pandas DataFrames — goes straight
in via
add_nodes(df, ...)andadd_connections(df, ...). Layer a graph on top of your warehouse and the agent reasons over the relationships without you writing a server. → Data Loading guide. - 🌐 Public datasets.
wikidata.open(path)andsodir.open(path)handle the fetch + build + cache cycle. Mapped and disk storage query graphs that don't fit in RAM — a billion-edge Wikidata graph on a 16 GB laptop. → See Bundled datasets below. - 📚 RAG with structure. Documents, chunks, entities, and the
edges between them in one graph. Combine
text_score()vector similarity with Cypher traversal — "find court cases semantically similar to my fact pattern, then walk one hop to related precedents" — hybrid retrieval in one query, no second vector DB. → Semantic Search guide. - 📂 Codebase analysis.
kglite.code_tree.build(".")parses 13 languages into Function / Class / Module / Route nodes with web-framework route detection (Flask, FastAPI, Django). See the notebook above for the full code → Claude Desktop workflow. → Code analysis guide.
Why Cypher?
Questions over connected data — which insiders sold this stock, who sits on two boards, what cites this case — are pattern matches. In SQL they become multi-table joins; in Cypher the pattern is the query:
-- Insider sells, most recent first
MATCH (t:InsiderTransaction {direction: 'sale'})-[:BY_INSIDER]->(p:Person)
MATCH (t)-[:IN_COMPANY]->(c:Company)
RETURN p.title, c.title, t.shares, t.price_per_share
ORDER BY t.transaction_date DESC LIMIT 10
Cypher pays off most when the data has real structure and your questions traverse it.
How it compares
| KGLite | Kuzu | NetworkX | rustworkx | Neo4j Embedded | |
|---|---|---|---|---|---|
| Install | pip install kglite |
pip install kuzu |
pip install networkx |
pip install rustworkx |
JVM + Java deps |
| Query language | Cypher (subset) | Cypher (full) | Python API | Python API | Cypher (full) |
| Storage | in-mem · mmap · disk (1B+ edges) | in-mem · disk (columnar) | in-mem | in-mem | in-mem · disk (JVM) |
| Bulk-load from pandas | one-liner | via Arrow | manual | manual | via driver |
| Bundled MCP server for LLM agents | ✅ | — | — | — | — |
describe() schema for LLM prompts |
✅ | — | — | — | — |
| Codebase → graph parser | 13 languages, route detection | — | — | — | — |
| Bundled public datasets | SEC EDGAR, Wikidata, Sodir | — | toy graphs only | — | — |
| License | MIT | MIT | BSD-3 | Apache-2 | GPLv3 |
Pick KGLite when you want Cypher + Python ergonomics + LLM-agent plumbing in one wheel. Pick Kuzu if your workload is heavy analytical OLAP and you can accept that the project is no longer maintained (archived 2025). Pick NetworkX when you need its enormous graph-algorithm library and your data fits in RAM. Pick rustworkx when you want NetworkX's API in Rust with no query language. Pick Neo4j Embedded when you've standardised on server-mode Cypher and want the in-process driver for tests.
What's coming. The roadmap lays out where this is heading — Bolt protocol server first (drop-in for any Neo4j-aware client), then bindings beyond Python.
Quick Start
pip install kglite
import pandas as pd
import kglite
# Three storage modes — pick by graph size:
# default (in-memory) — small/medium graphs, fastest queries
# storage="mapped" — mmap columns, RAM-friendly as you grow
# storage="disk", path=… — 100M+ nodes, Wikidata-scale, loaded lazily
graph = kglite.KnowledgeGraph()
# Bulk-load nodes from a DataFrame.
people = pd.DataFrame({
"id": ["alice", "bob", "eve"],
"name": ["Alice", "Bob", "Eve"],
"age": [28, 35, 41],
"city": ["Oslo", "Bergen", "Trondheim"],
})
graph.add_nodes(people, node_type="Person", unique_id_field="id", node_title_field="name")
# Bulk-load relationships the same way.
knows = pd.DataFrame({"src": ["alice", "bob"], "tgt": ["bob", "eve"]})
graph.add_connections(knows, connection_type="KNOWS",
source_type="Person", source_id_field="src",
target_type="Person", target_id_field="tgt")
# Query — returns a ResultView (lazy; data stays in Rust until accessed).
for row in graph.cypher("""
MATCH (p:Person) WHERE p.age > 30
RETURN p.name AS name, p.city AS city
ORDER BY p.age DESC
"""):
print(row['name'], row['city'])
# Or get a pandas DataFrame directly.
df = graph.cypher("MATCH (p:Person) RETURN p.name, p.age ORDER BY p.age", to_df=True)
# Persist to disk and reload.
graph.save("my_graph.kgl")
loaded = kglite.load("my_graph.kgl")
→ Getting Started guide · Cypher reference · API reference.
Serve it to an agent
Three levels of effort, three levels of capability.
1. One command — any .kgl becomes an MCP server
kglite-mcp-server --graph path/to/graph.kgl
The server exposes cypher_query, graph_overview, schema
introspection, structural validators, and source-file tools over MCP
stdio. Drop it into Claude Desktop / Cursor / any MCP-capable client
and your graph is queryable. Works on every graph kglite can build —
your own, Wikidata, Sodir, code-tree.
2. Customise with a YAML manifest
Drop <basename>_mcp.yaml next to the graph (e.g. wikidata_mcp.yaml
beside wikidata.kgl) and the server auto-loads it at boot.
name: Wikidata Explorer
source_root: /path/to/related/source # exposes read/grep/list
extensions:
embedder: { kind: fastembed, model: bge-small } # enables text_score()
csv_http_server: true # bulk CSV exports
tools: # inline parameterised Cypher
- name: who_invented
cypher: |
MATCH (i:Q5)-[:P61]->(t {label:$thing})
RETURN i.label LIMIT 5
No fork required for most customisation. → MCP server guide.
3. Teach the agent with bundled skills
Markdown skill files (<basename>.skills/*.md) ship methodology for
each tool. The agent reads cypher_query.md at session start to learn
your schema conventions, read_code_source.md to know when to drill
into source vs. query the graph, etc. Three layers compose:
kglite-bundled defaults + your project's .skills/ overrides +
operator-declared domain packs. Skills with applies_when: predicates
only activate when the graph contains the relevant node types — so a
non-code graph never sees read_code_source methodology.
Net effect: the agent comes pre-loaded with how to use your graph, rather than discovering it through trial-and-error. → AI Agents guide.
Bundled datasets
Three wrappers turn well-known public sources into queryable graphs
without writing a loader. Each handles the fetch + build + cache
cycle, returns a KnowledgeGraph you can cypher() against, and
respects a per-dataset cooldown so re-running just reloads the cached
graph in seconds. KGLite is independent of the upstream
organisations — see each module docstring for non-affiliation notes.
→ Datasets guide.
SEC EDGAR
US-public-company knowledge graph from the SEC's free public data — all 14M historical filings + per-filing payload parsing for Form 4 (insider transactions), 13F-HR (institutional holdings), SC 13D (activist stakes), DEF 14A (board composition), XBRL company facts (financial metrics), 10-K Exhibit 21 (subsidiaries), 8-K cover pages (material event Item codes):
from kglite.datasets.sec import SEC
# SEC.fetch — name the forms, the companies, a span; get a graph back.
g = SEC.fetch("/data/sec", ["4", "8-K", "DEF 14A"], ["AAPL", "TSLA"],
years=2, user_agent="Your Name your@email.com")
# SEC.open — full control: separate filing-index vs. payload spans,
# storage mode, and the include_* flags (XBRL financials, Exhibit 21
# subsidiaries).
g = SEC.open("/data/sec", years=10, detailed=2,
user_agent="Your Name your@email.com")
# Full universe — drop `companies`; auto-escalates to mode="disk".
g = SEC.open("/data/sec", years="all", detailed=5,
user_agent="Your Name your@email.com")
Two dozen-plus typed node types — Company, Person, Filing,
InsiderTransaction, Holding, InstitutionalHolding, CorporateEvent,
Compensation, Role, MetricFact, Subsidiary and more — wired by typed
edges, every fact node tracing back to its source filing. Three-tier
raw / processed / graph/{mode} cache
— raw is immutable, processed regenerates only when its raw
source changes, graph/{mode}/ reuses on reopen unless
force_rebuild=True. SEC's 10 req/s fair-access policy is enforced
by an internal token-bucket rate limiter; the user_agent arg is
mandatory (SEC returns 403 without it).
Source data is public domain (US Govt work) — redistribute the built
.kgl however you like. →
SEC guide.
Wikidata
Single-stream latest-truthy.nt.bz2 from
dumps.wikimedia.org —
parallel-decoded with a bit-level block scanner, parsed, built into a
queryable graph in one call:
from kglite.datasets import wikidata
g = wikidata.open("/data/wd") # full graph
g = wikidata.open("/data/wd", entity_limit_millions=100) # 100M slice
g = wikidata.open("/data/wd", storage="memory", # in-memory, fast tests
entity_limit_millions=10)
Sodir (Norwegian Offshore Directorate)
Petroleum-domain example dataset — sodir.open("/data/sodir") returns
a queryable graph of fields, wellbores, discoveries, licences,
stratigraphy and 28 more node types from the public ArcGIS REST
FeatureServer at factmaps.sodir.no.
Built in ~30 s on first run, cached after. Useful as a worked example
of complement_blueprint (extend a baseline schema without touching
the canonical types) — → Datasets guide.
Recipes
Short patterns for the most-common shapes. Each is self-contained.
Hybrid semantic + structural retrieval
Combine vector similarity (text_score()) with Cypher pattern
matching in one query:
graph.cypher("""
MATCH (c:Chunk)-[:IN_DOC]->(d:Document)
RETURN c.text, d.title,
text_score(c.embedding, $query_vec) AS score
ORDER BY score DESC LIMIT 5
""", params={"query_vec": query_embedding})
Vector embeddings via pip install 'kglite[embed]' (adds fastembed +
onnxruntime). → Semantic Search guide.
Structural validators — surface data-integrity gaps
Fourteen built-in CALL procedures find the gaps that aren't visible
from normal queries: orphan nodes, missing-required-edge violations,
two-step cycles, duplicate titles, parallel edges, cardinality
violations, more. They compose with the rest of Cypher.
# Wellbores in our sodir graph that lack a production licence
graph.cypher("""
CALL missing_required_edge({type: 'Wellbore', edge: 'IN_LICENCE'}) YIELD node
RETURN node.id, node.title
""")
missing_required_edge and missing_inbound_edge validate the
(type, edge) direction against the graph's actual schema and refuse
to execute when misused. → Full procedure list in the
Cypher reference.
Graph algorithms
Shortest path (BFS or Dijkstra), centrality, community detection, clustering — all in Cypher:
graph.cypher("""
MATCH path = shortestPath((a:User {name:'Alice'})-[*]-(b:User {name:'Eve'}))
RETURN path
""")
→ Graph algorithms guide · Traversal patterns · Recipes index.
Examples
The examples/
directory has runnable, self-contained artifacts:
codebase_to_claude_mcp.ipynb— clone an open-source repo, parse it into a code knowledge graph, register a workspace MCP server in Claude Desktop.sec_to_claude_mcp.ipynb— build a graph of SEC filings withSEC.fetch, query it, register it as a Claude Desktop MCP server.open_source_workspace_mcp.yaml— annotated workspace-mode manifest for the github-clone-tracker pattern. Walked through in the workspace manifest example.legal_graph.py— end-to-endadd_nodes/add_connectionsfrom pandas DataFrames, covering laws, regulations, court decisions with citation edges.code_graph.py— build a code knowledge graph from a source directory viacode_tree.build.spatial_graph.py— declarative CSV→graph loading via a JSON blueprint; lat/lon coordinates and pipeline-path traversal queries.crates/kglite-mcp-server/— Rust-native single-binary MCP server (built on rmcp + the mcp-methods framework). Reach for it when the manifest doesn't express what you need; the binary is the reference for layering domain-specific tools on top of the generic surface.
Benchmarks
KGLite builds and queries Wikidata-scale graphs on a laptop. Measured
with bench/wiki_benchmark.py
on an M-series MacBook.
Ingest — full pipeline from compressed N-Triples to a queryable graph:
| dataset | triples | nodes | edges | ingest | throughput | peak RAM |
|---|---|---|---|---|---|---|
| wiki100m | 100 M | 938 K | 748 K | 29 s | 3.4 M triples/s | 1.3 GB |
| wiki500m | 500 M | 5.6 M | 6.7 M | 157 s | 3.2 M triples/s | 5.2 GB |
| wiki1000m | 1 B | 14.7 M | 15.4 M | 395 s | 2.5 M triples/s | 7.0 GB |
Reloading a saved 1 B-triple graph from disk (7 GB on-disk): 3.5 s.
Query latency on the 1 B-triple graph (mapped storage):
| Cypher | wall |
|---|---|
MATCH (n)-[:P31]->(:human) RETURN count(n) — typed aggregation |
0.5 ms |
MATCH (a)-[:P31]->(b)-[:P279]->(c) LIMIT 10 — 2-hop typed |
0.9 ms |
MATCH (a)-[:P31]->(b {nid:'Q64'}) RETURN a LIMIT 20 — pivot |
1 ms |
MATCH (a)-[:P31]->(:human) MATCH (a)-[:P27]->(c) LIMIT 10 — join |
44 ms |
Disk and mapped storage build at the same speed; mapped wins on small-result queries (in-memory inverted index), disk wins on unbounded typed traversals (sorted-CSR mmap I/O). No server, no tuning, same Python process as your code.
Key Features
Quick reference. Each links into the appropriate guide.
| Feature | Description |
|---|---|
| Cypher | MATCH, CREATE, SET, DELETE, MERGE, UNION/INTERSECT/EXCEPT, aggregations (incl. median, percentile_cont, variance), reduce(), ORDER BY, LIMIT, SKIP |
| Semantic search | Vector embeddings + text_score() for similarity ranking. Opt-in via pip install 'kglite[embed]'. |
| Text predicates | text_edit_distance, text_normalize, text_jaccard, text_ngrams, text_contains_any / text_starts_with_any |
| Graph algorithms | Shortest path (BFS or Dijkstra), centrality, community detection, clustering |
| Structural validators | 14 CALL procedures: orphan_node, missing_required_edge, cycle_2step, inverse_violation, cardinality_violation, parallel_edges, null_property, more — agent-discoverable integrity checks composable with Cypher |
| Spatial | Coordinates, WKT geometry, distance + containment, kg_knn k-nearest-neighbour. Pragmatic primitives, not a full GIS stack. |
| Timeseries | Time-indexed values with ts_*() Cypher functions. For graphs whose nodes carry value-over-time series. |
| Bulk loading | add_nodes / add_connections for DataFrames |
| Blueprints | Declarative CSV-to-graph loading via JSON config |
| Import/Export | Save/load snapshots (.kgl), GraphML, CSV export |
| AI integration | describe() introspection, MCP server, agent prompts |
| Code analysis | 14-language tree-sitter parser (kglite.code_tree) — functions, classes, calls, imports, web-framework routes |
Documentation
Full docs at kglite.readthedocs.io:
Getting started
- Getting Started — installation, first graph, core concepts
- Querying overview — Cypher vs fluent API, when to reach for which
- Recipes index — copy-paste patterns for common shapes
Querying
- Cypher Guide — MATCH, MERGE, mutations, parameters, validators
- Traversal & hierarchy — variable-length paths, tree walks
- Graph algorithms — shortest path, PageRank, community detection
- Semantic Search — embeddings, vector search, hybrid retrieval
Loading data
- Data Loading — DataFrames in, DataFrames out
- Blueprints — declarative CSV→graph via JSON config
- Datasets — Wikidata + Sodir wrappers
- Code analysis —
code_tree.build, framework route detection - Import / Export —
.kglsnapshots, GraphML, CSV
Domain features
- Spatial — WKT geometry, lat/lon, k-nearest-neighbour
- Timeseries — time-indexed values,
ts_*()functions
Agent integration
- AI Agents — MCP server,
describe(), agent prompts - MCP server config — manifests, skills, extensions
Reference
- API Reference — full auto-generated reference
Requirements
Python 3.10+ (CPython) | macOS (ARM), Linux (x86_64/aarch64), Windows (x86_64) | pandas >= 1.5
License
MIT — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kglite-0.10.0-cp310-abi3-win_amd64.whl.
File metadata
- Download URL: kglite-0.10.0-cp310-abi3-win_amd64.whl
- Upload date:
- Size: 11.2 MB
- Tags: CPython 3.10+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6613b33a35600a48f5c7c327d26e383116996bc9feeb502c032fab7cfffa74f
|
|
| MD5 |
790d5f736d818cff0ffcc9e3182ef95d
|
|
| BLAKE2b-256 |
80e91eedaf04e4e08f8cd883dd1103f6bc4a2ada0811a560ca36d2e458f52fa4
|
File details
Details for the file kglite-0.10.0-cp310-abi3-manylinux_2_39_x86_64.whl.
File metadata
- Download URL: kglite-0.10.0-cp310-abi3-manylinux_2_39_x86_64.whl
- Upload date:
- Size: 11.1 MB
- Tags: CPython 3.10+, manylinux: glibc 2.39+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
79e71fc969aa23c81d97c5ec7329d39c66c7c7a88709bce72929dcab30d0a7ac
|
|
| MD5 |
aa48d5156ea480c4dece33a0b83a3fe4
|
|
| BLAKE2b-256 |
6c7311fa27ef8de54c57f97c94a1bfad869b4ed1a4e5d863496b4e73f0c20f71
|
File details
Details for the file kglite-0.10.0-cp310-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: kglite-0.10.0-cp310-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 10.1 MB
- Tags: CPython 3.10+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37220bd6ad94f5fa0a2ec470104121a56848a48d7965348cd3779dd6a3c1e12d
|
|
| MD5 |
280656beedf8a8e2942aa6b02ecaa4ca
|
|
| BLAKE2b-256 |
85db07d090d50d09ca5f9c64eed7aa8f7e36373120ceb597cc31a06335db49c4
|