Skip to main content

A high-performance graph database library with Python bindings written in Rust

Project description

KGLite — Knowledge graph for Python, built for LLM agents

PyPI version Python versions crates.io docs.rs License: MIT Docs

KGLite is an embedded, Cypher-queryable knowledge graph for Python, built so you can hand it to an LLM agent. pip install kglite and point kglite.code_tree.build(".") at any source directory — your first queryable graph in seconds. It ships with a bundled MCP server, a describe() method that emits a system-prompt-shaped schema, and structural validators that compose with Cypher.

kglite is a pure-Rust knowledge graph engine (crates/kglite) packaged for Python via pip install kglite. The Bolt-server and MCP-server binaries are sibling Rust crates wrapping the same engine. If you want kglite as a Rust library — without the Python wheel in your build — see Use from Rust below.

Codebase → Claude

examples/codebase_to_claude_mcp.ipynb clones a GitHub repo, parses it into a code knowledge graph, and registers a workspace MCP server in Claude Desktop.

SEC filings → graph

from kglite.datasets.sec import SEC
g = SEC.fetch("./sec", "13F-HR", "TSLA", years=2,
              user_agent="Your Name your@email.com")

SEC.fetch downloads the named forms for the named companies and returns a Cypher-queryable graph — Form 4 insider transactions, 13F holdings, SC 13D stakes, DEF 14A board composition, 8-K events. examples/sec_to_claude_mcp.ipynb · SEC guide.

Use cases

The same agent-facing surface works whether the graph holds legal precedents, a Wikidata slice, a SQL warehouse, a RAG corpus, or a parsed codebase.

  • 🏦 SEC EDGAR. SEC.fetch(path, forms, companies, years=2) builds a US-public-company graph from the SEC's free data: insider transactions (Form 4), institutional holdings (13F), activist stakes (SC 13D), board composition (DEF 14A), 8-K events — with XBRL financials and Exhibit 21 subsidiaries via SEC.open. SEC guide.
  • 🏛️ Domain knowledge for agents. Legal precedents + citations, regulatory rules, medical ontologies, manufacturing BOMs, scientific catalogues — anything with structure becomes a queryable graph an MCP-capable agent can reason over. See the legal-graph example for a Norwegian-Supreme-Court walk-through (laws + decisions + citation edges + judge metadata).
  • 📊 Business data → queryable graph. Any tabular source — SQL, CSV, Parquet, REST API responses, pandas DataFrames — goes straight in via add_nodes(df, ...) and add_connections(df, ...). Layer a graph on top of your warehouse and the agent reasons over the relationships without you writing a server. Data Loading guide.
  • 🌐 Public datasets. wikidata.open(path) and sodir.open(path) handle the fetch + build + cache cycle. Mapped and disk storage query graphs that don't fit in RAM — a billion-edge Wikidata graph on a 16 GB laptop. → See Bundled datasets below.
  • 📚 RAG with structure. Documents, chunks, entities, and the edges between them in one graph. Combine text_score() vector similarity with Cypher traversal — "find court cases semantically similar to my fact pattern, then walk one hop to related precedents" — hybrid retrieval in one query, no second vector DB. Semantic Search guide.
  • 📂 Codebase analysis. kglite.code_tree.build(".") parses 13 languages into Function / Class / Module / Route nodes with web-framework route detection (Flask, FastAPI, Django). See the notebook above for the full code → Claude Desktop workflow. Code analysis guide.

Why Cypher?

Questions over connected data — which insiders sold this stock, who sits on two boards, what cites this case — are pattern matches. In SQL they become multi-table joins; in Cypher the pattern is the query:

-- Insider sells, most recent first
MATCH (t:InsiderTransaction {direction: 'sale'})-[:BY_INSIDER]->(p:Person)
MATCH (t)-[:IN_COMPANY]->(c:Company)
RETURN p.title, c.title, t.shares, t.price_per_share
ORDER BY t.transaction_date DESC LIMIT 10

Cypher pays off most when the data has real structure and your questions traverse it.

How it compares

KGLite Kuzu NetworkX rustworkx Neo4j Embedded
Install pip install kglite pip install kuzu pip install networkx pip install rustworkx JVM + Java deps
Query language Cypher (subset) Cypher (full) Python API Python API Cypher (full)
Storage in-mem · mmap · disk (1B+ edges) in-mem · disk (columnar) in-mem in-mem in-mem · disk (JVM)
Bulk-load from pandas one-liner via Arrow manual manual via driver
Bundled MCP server for LLM agents
describe() schema for LLM prompts
Embeddable in Rust (no Python in build) ✅ (crates/kglite)
Codebase → graph parser 13 languages, route detection
Bundled public datasets SEC EDGAR, Wikidata, Sodir toy graphs only
License MIT MIT BSD-3 Apache-2 GPLv3

Pick KGLite when you want Cypher + Python ergonomics + LLM-agent plumbing in one wheel. Pick Kuzu if your workload is heavy analytical OLAP and you can accept that the project is no longer maintained (archived 2025). Pick NetworkX when you need its enormous graph-algorithm library and your data fits in RAM. Pick rustworkx when you want NetworkX's API in Rust with no query language. Pick Neo4j Embedded when you've standardised on server-mode Cypher and want the in-process driver for tests.

What's coming. The roadmap lays out where this is heading — Bolt protocol server first (drop-in for any Neo4j-aware client), then bindings beyond Python.

Quick Start

# Python (the headline distribution path)
pip install kglite

# Optional extras
pip install 'kglite[embed]'      # fastembed + onnxruntime for text_score()
pip install 'kglite[neo4j]'      # Neo4j Python driver for Bolt-server tests
import pandas as pd
import kglite

# Three storage modes — pick by graph size:
#   default (in-memory)   — small/medium graphs, fastest queries
#   storage="mapped"      — mmap columns, RAM-friendly as you grow
#   storage="disk", path=…  — 100M+ nodes, Wikidata-scale, loaded lazily
graph = kglite.KnowledgeGraph()

# Bulk-load nodes from a DataFrame.
people = pd.DataFrame({
    "id":   ["alice", "bob", "eve"],
    "name": ["Alice", "Bob", "Eve"],
    "age":  [28, 35, 41],
    "city": ["Oslo", "Bergen", "Trondheim"],
})
graph.add_nodes(people, node_type="Person", unique_id_field="id", node_title_field="name")

# Bulk-load relationships the same way.
knows = pd.DataFrame({"src": ["alice", "bob"], "tgt": ["bob", "eve"]})
graph.add_connections(knows, connection_type="KNOWS",
                      source_type="Person", source_id_field="src",
                      target_type="Person", target_id_field="tgt")

# Query — returns a ResultView (lazy; data stays in Rust until accessed).
for row in graph.cypher("""
    MATCH (p:Person) WHERE p.age > 30
    RETURN p.name AS name, p.city AS city
    ORDER BY p.age DESC
"""):
    print(row['name'], row['city'])

# Or get a pandas DataFrame directly.
df = graph.cypher("MATCH (p:Person) RETURN p.name, p.age ORDER BY p.age", to_df=True)

# Persist to disk and reload.
graph.save("my_graph.kgl")
loaded = kglite.load("my_graph.kgl")

Getting Started guide · Cypher reference · API reference.

Serve it to an agent

Three levels of effort, three levels of capability.

1. One command — any .kgl becomes an MCP server

kglite-mcp-server --graph path/to/graph.kgl

The server exposes cypher_query, graph_overview, schema introspection, structural validators, and source-file tools over MCP stdio. Drop it into Claude Desktop / Cursor / any MCP-capable client and your graph is queryable. Works on every graph kglite can build — your own, Wikidata, Sodir, code-tree.

2. Customise with a YAML manifest

Drop <basename>_mcp.yaml next to the graph (e.g. wikidata_mcp.yaml beside wikidata.kgl) and the server auto-loads it at boot.

name: Wikidata Explorer
source_root: /path/to/related/source        # exposes read/grep/list
extensions:
  embedder: { kind: fastembed, model: bge-small }   # enables text_score()
  csv_http_server: true                              # bulk CSV exports
tools:                                               # inline parameterised Cypher
  - name: who_invented
    cypher: |
      MATCH (i:Q5)-[:P61]->(t {label:$thing})
      RETURN i.label LIMIT 5

No fork required for most customisation. MCP server guide.

3. Teach the agent with bundled skills

Markdown skill files (<basename>.skills/*.md) ship methodology for each tool. The agent reads cypher_query.md at session start to learn your schema conventions, read_code_source.md to know when to drill into source vs. query the graph, etc. Three layers compose: kglite-bundled defaults + your project's .skills/ overrides + operator-declared domain packs. Skills with applies_when: predicates only activate when the graph contains the relevant node types — so a non-code graph never sees read_code_source methodology.

Net effect: the agent comes pre-loaded with how to use your graph, rather than discovering it through trial-and-error. AI Agents guide.

Bundled datasets

Three wrappers turn well-known public sources into queryable graphs without writing a loader. Each handles the fetch + build + cache cycle, returns a KnowledgeGraph you can cypher() against, and respects a per-dataset cooldown so re-running just reloads the cached graph in seconds. KGLite is independent of the upstream organisations — see each module docstring for non-affiliation notes. Datasets guide.

SEC EDGAR

US-public-company knowledge graph from the SEC's free public data — all 14M historical filings + per-filing payload parsing for Form 4 (insider transactions), 13F-HR (institutional holdings), SC 13D (activist stakes), DEF 14A (board composition), XBRL company facts (financial metrics), 10-K Exhibit 21 (subsidiaries), 8-K cover pages (material event Item codes):

from kglite.datasets.sec import SEC

# SEC.fetch — name the forms, the companies, a span; get a graph back.
g = SEC.fetch("/data/sec", ["4", "8-K", "DEF 14A"], ["AAPL", "TSLA"],
              years=2, user_agent="Your Name your@email.com")

# SEC.open — full control: separate filing-index vs. payload spans,
# storage mode, and the include_* flags (XBRL financials, Exhibit 21
# subsidiaries).
g = SEC.open("/data/sec", years=10, detailed=2,
             user_agent="Your Name your@email.com")

# Full universe — drop `companies`; auto-escalates to mode="disk".
g = SEC.open("/data/sec", years="all", detailed=5,
             user_agent="Your Name your@email.com")

Two dozen-plus typed node types — Company, Person, Filing, InsiderTransaction, Holding, InstitutionalHolding, CorporateEvent, Compensation, Role, MetricFact, Subsidiary and more — wired by typed edges, every fact node tracing back to its source filing. Three-tier raw / processed / graph/{mode} cache — raw is immutable, processed regenerates only when its raw source changes, graph/{mode}/ reuses on reopen unless force_rebuild=True. SEC's 10 req/s fair-access policy is enforced by an internal token-bucket rate limiter; the user_agent arg is mandatory (SEC returns 403 without it).

Source data is public domain (US Govt work) — redistribute the built .kgl however you like. SEC guide.

Wikidata

Single-stream latest-truthy.nt.bz2 from dumps.wikimedia.org — parallel-decoded with a bit-level block scanner, parsed, built into a queryable graph in one call:

from kglite.datasets import wikidata

g = wikidata.open("/data/wd")                                    # full graph
g = wikidata.open("/data/wd", entity_limit_millions=100)         # 100M slice
g = wikidata.open("/data/wd", storage="memory",                  # in-memory, fast tests
                  entity_limit_millions=10)

Sodir (Norwegian Offshore Directorate)

Petroleum-domain example dataset — sodir.open("/data/sodir") returns a queryable graph of fields, wellbores, discoveries, licences, stratigraphy and 28 more node types from the public ArcGIS REST FeatureServer at factmaps.sodir.no. Built in ~30 s on first run, cached after. Useful as a worked example of complement_blueprint (extend a baseline schema without touching the canonical types) — Datasets guide.

Recipes

Short patterns for the most-common shapes. Each is self-contained.

Hybrid semantic + structural retrieval

Combine vector similarity (text_score()) with Cypher pattern matching in one query:

graph.cypher("""
    MATCH (c:Chunk)-[:IN_DOC]->(d:Document)
    RETURN c.text, d.title,
           text_score(c.embedding, $query_vec) AS score
    ORDER BY score DESC LIMIT 5
""", params={"query_vec": query_embedding})

Vector embeddings via pip install 'kglite[embed]' (adds fastembed + onnxruntime). Semantic Search guide.

Structural validators — surface data-integrity gaps

Fourteen built-in CALL procedures find the gaps that aren't visible from normal queries: orphan nodes, missing-required-edge violations, two-step cycles, duplicate titles, parallel edges, cardinality violations, more. They compose with the rest of Cypher.

# Wellbores in our sodir graph that lack a production licence
graph.cypher("""
    CALL missing_required_edge({type: 'Wellbore', edge: 'IN_LICENCE'}) YIELD node
    RETURN node.id, node.title
""")

missing_required_edge and missing_inbound_edge validate the (type, edge) direction against the graph's actual schema and refuse to execute when misused. → Full procedure list in the Cypher reference.

Graph algorithms

Shortest path (BFS or Dijkstra), centrality, community detection, clustering — all in Cypher:

graph.cypher("""
    MATCH path = shortestPath((a:User {name:'Alice'})-[*]-(b:User {name:'Eve'}))
    RETURN path
""")

Graph algorithms guide · Traversal patterns · Recipes index.

Use from Rust

The same engine is available as a pure-Rust crate — embed it in a Rust binary without the Python wheel in your build:

# Cargo.toml
[dependencies]
kglite = "0.10"
use kglite::api::{load_file, session, Value};
use std::collections::HashMap;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let graph = load_file("my_graph.kgl")?;     // same .kgl as Python writes
    let params = HashMap::new();
    let opts = session::ExecuteOptions {
        params: &params, deadline: None, max_rows: None,
        lazy_eligible: false, disabled_passes: None, embedder: None,
    };
    let outcome = session::execute_read(
        &graph,
        "MATCH (p:Person) RETURN p.name LIMIT 5",
        &opts,
    )?;
    for row in &outcome.result.rows {
        if let Some(Value::String(name)) = row.first() {
            println!("{}", name);
        }
    }
    Ok(())
}

Zero PyO3 in the dependency tree: cargo tree -p your-crate | grep pyo3 → empty.

The Bolt server (crates/kglite-bolt-server) and the Rust MCP server (crates/kglite-mcp-server) are standalone binaries built on the same engine — see the Operators guide for deployment.

Examples

The examples/ directory has runnable, self-contained artifacts:

  • codebase_to_claude_mcp.ipynb — clone an open-source repo, parse it into a code knowledge graph, register a workspace MCP server in Claude Desktop.
  • sec_to_claude_mcp.ipynb — build a graph of SEC filings with SEC.fetch, query it, register it as a Claude Desktop MCP server.
  • open_source_workspace_mcp.yaml — annotated workspace-mode manifest for the github-clone-tracker pattern. Walked through in the workspace manifest example.
  • legal_graph.py — end-to-end add_nodes / add_connections from pandas DataFrames, covering laws, regulations, court decisions with citation edges.
  • code_graph.py — build a code knowledge graph from a source directory via code_tree.build.
  • spatial_graph.py — declarative CSV→graph loading via a JSON blueprint; lat/lon coordinates and pipeline-path traversal queries.
  • crates/kglite-mcp-server/ — Rust-native single-binary MCP server (built on rmcp + the mcp-methods framework). Reach for it when the manifest doesn't express what you need; the binary is the reference for layering domain-specific tools on top of the generic surface.

Benchmarks

KGLite builds and queries Wikidata-scale graphs on a laptop. Measured with bench/wiki_benchmark.py on an M-series MacBook.

Ingest — full pipeline from compressed N-Triples to a queryable graph:

dataset triples nodes edges ingest throughput peak RAM
wiki100m 100 M 938 K 748 K 29 s 3.4 M triples/s 1.3 GB
wiki500m 500 M 5.6 M 6.7 M 157 s 3.2 M triples/s 5.2 GB
wiki1000m 1 B 14.7 M 15.4 M 395 s 2.5 M triples/s 7.0 GB

Reloading a saved 1 B-triple graph from disk (7 GB on-disk): 3.5 s.

Query latency on the 1 B-triple graph (mapped storage):

Cypher wall
MATCH (n)-[:P31]->(:human) RETURN count(n) — typed aggregation 0.5 ms
MATCH (a)-[:P31]->(b)-[:P279]->(c) LIMIT 10 — 2-hop typed 0.9 ms
MATCH (a)-[:P31]->(b {nid:'Q64'}) RETURN a LIMIT 20 — pivot 1 ms
MATCH (a)-[:P31]->(:human) MATCH (a)-[:P27]->(c) LIMIT 10 — join 44 ms

Disk and mapped storage build at the same speed; mapped wins on small-result queries (in-memory inverted index), disk wins on unbounded typed traversals (sorted-CSR mmap I/O). No server, no tuning, same Python process as your code.

Key Features

Quick reference. Each links into the appropriate guide.

Feature Description
Cypher MATCH, CREATE, SET, DELETE, MERGE, UNION/INTERSECT/EXCEPT, aggregations (incl. median, percentile_cont, variance), reduce(), ORDER BY, LIMIT, SKIP
Semantic search Vector embeddings + text_score() for similarity ranking. Opt-in via pip install 'kglite[embed]'.
Text predicates text_edit_distance, text_normalize, text_jaccard, text_ngrams, text_contains_any / text_starts_with_any
Graph algorithms Shortest path (BFS or Dijkstra), centrality, community detection, clustering
Structural validators 14 CALL procedures: orphan_node, missing_required_edge, cycle_2step, inverse_violation, cardinality_violation, parallel_edges, null_property, more — agent-discoverable integrity checks composable with Cypher
Spatial Coordinates, WKT geometry, distance + containment, kg_knn k-nearest-neighbour. Pragmatic primitives, not a full GIS stack.
Timeseries Time-indexed values with ts_*() Cypher functions. For graphs whose nodes carry value-over-time series.
Bulk loading add_nodes / add_connections for DataFrames
Blueprints Declarative CSV-to-graph loading via JSON config
Import/Export Save/load snapshots (.kgl), GraphML, CSV export
AI integration describe() introspection, MCP server, agent prompts
Code analysis 14-language tree-sitter parser (kglite.code_tree) — functions, classes, calls, imports, web-framework routes

Documentation

Full docs at kglite.readthedocs.io — five tracks by audience.

Python trackpip install kglite

Rust trackcargo add kglite

Operators — running the protocol servers

  • Bolt server — Neo4j wire compat for cluster-aware drivers

Reference — cross-binding

Concepts — architecture + contributor docs

Requirements

Python 3.10+ (CPython) | macOS (ARM), Linux (x86_64/aarch64), Windows (x86_64) | pandas >= 1.5

License

MIT — see LICENSE for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kglite-0.10.2-cp310-abi3-win_amd64.whl (11.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

kglite-0.10.2-cp310-abi3-manylinux_2_39_x86_64.whl (11.2 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.39+ x86-64

kglite-0.10.2-cp310-abi3-macosx_11_0_arm64.whl (10.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file kglite-0.10.2-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: kglite-0.10.2-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 11.2 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kglite-0.10.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b9845d7f9fe690a5aa4361e77d62e1a78651d3b1ecf7f094b2cfb17d446052b8
MD5 834df3731261c2c61333d07d155373f7
BLAKE2b-256 19cb03f0f3395bfff94d55744be29944af5c2b3c5883ab9f33de9e4faf8b6544

See more details on using hashes here.

File details

Details for the file kglite-0.10.2-cp310-abi3-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for kglite-0.10.2-cp310-abi3-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 ce4b3e8fddc010eb20f1961782163d5c419909ac0ca50d328a2a027f40209aca
MD5 36985e632f6b1fd10f6d0869fefb32e0
BLAKE2b-256 b90415d3af45fe7e606ac2d0c7272156ad7f16f5839eff5511324bb64595e316

See more details on using hashes here.

File details

Details for the file kglite-0.10.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kglite-0.10.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ce8dc3c437eaa25938172f336aec5c9b36139c0135c8dc95a9501af8f0bccd61
MD5 a61bcafba5a26721fa19dd2175d4fe93
BLAKE2b-256 7e62702bc2a19ef10f0ba8de669211150de60e60ebe30be380b0191f0993d118

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page