Skip to main content

A high-performance graph database library with Python bindings written in Rust

Project description

KGLite — Knowledge graph for Python, built for LLM agents

PyPI version Python versions License: MIT Docs

KGLite is an embedded, Cypher-queryable knowledge graph for Python, built so you can hand it to an LLM agent. pip install kglite and point kglite.code_tree.build(".") at any source directory — your first queryable graph in seconds. It ships with a bundled MCP server, a describe() method that emits a system-prompt-shaped schema, and structural validators that compose with Cypher.

Codebase → Claude

examples/codebase_to_claude_mcp.ipynb clones a GitHub repo, parses it into a code knowledge graph, and registers a workspace MCP server in Claude Desktop.

SEC filings → graph

from kglite.datasets.sec import SEC
g = SEC.fetch("./sec", "13F-HR", "TSLA", years=2,
              user_agent="Your Name your@email.com")

SEC.fetch downloads the named forms for the named companies and returns a Cypher-queryable graph — Form 4 insider transactions, 13F holdings, SC 13D stakes, DEF 14A board composition, 8-K events. examples/sec_to_claude_mcp.ipynb · SEC guide.

Use cases

The same agent-facing surface works whether the graph holds legal precedents, a Wikidata slice, a SQL warehouse, a RAG corpus, or a parsed codebase.

  • 🏦 SEC EDGAR. SEC.fetch(path, forms, companies, years=2) builds a US-public-company graph from the SEC's free data: insider transactions (Form 4), institutional holdings (13F), activist stakes (SC 13D), board composition (DEF 14A), 8-K events — with XBRL financials and Exhibit 21 subsidiaries via SEC.open. SEC guide.
  • 🏛️ Domain knowledge for agents. Legal precedents + citations, regulatory rules, medical ontologies, manufacturing BOMs, scientific catalogues — anything with structure becomes a queryable graph an MCP-capable agent can reason over. See the legal-graph example for a Norwegian-Supreme-Court walk-through (laws + decisions + citation edges + judge metadata).
  • 📊 Business data → queryable graph. Any tabular source — SQL, CSV, Parquet, REST API responses, pandas DataFrames — goes straight in via add_nodes(df, ...) and add_connections(df, ...). Layer a graph on top of your warehouse and the agent reasons over the relationships without you writing a server. Data Loading guide.
  • 🌐 Public datasets. wikidata.open(path) and sodir.open(path) handle the fetch + build + cache cycle. Mapped and disk storage query graphs that don't fit in RAM — a billion-edge Wikidata graph on a 16 GB laptop. → See Bundled datasets below.
  • 📚 RAG with structure. Documents, chunks, entities, and the edges between them in one graph. Combine text_score() vector similarity with Cypher traversal — "find court cases semantically similar to my fact pattern, then walk one hop to related precedents" — hybrid retrieval in one query, no second vector DB. Semantic Search guide.
  • 📂 Codebase analysis. kglite.code_tree.build(".") parses 13 languages into Function / Class / Module / Route nodes with web-framework route detection (Flask, FastAPI, Django). See the notebook above for the full code → Claude Desktop workflow. Code analysis guide.

Why Cypher?

Questions over connected data — which insiders sold this stock, who sits on two boards, what cites this case — are pattern matches. In SQL they become multi-table joins; in Cypher the pattern is the query:

-- Insider sells, most recent first
MATCH (t:InsiderTransaction {direction: 'sale'})-[:BY_INSIDER]->(p:Person)
MATCH (t)-[:IN_COMPANY]->(c:Company)
RETURN p.title, c.title, t.shares, t.price_per_share
ORDER BY t.transaction_date DESC LIMIT 10

Cypher pays off most when the data has real structure and your questions traverse it.

How it compares

KGLite Kuzu NetworkX rustworkx Neo4j Embedded
Install pip install kglite pip install kuzu pip install networkx pip install rustworkx JVM + Java deps
Query language Cypher (subset) Cypher (full) Python API Python API Cypher (full)
Storage in-mem · mmap · disk (1B+ edges) in-mem · disk (columnar) in-mem in-mem in-mem · disk (JVM)
Bulk-load from pandas one-liner via Arrow manual manual via driver
Bundled MCP server for LLM agents
describe() schema for LLM prompts
Codebase → graph parser 13 languages, route detection
Bundled public datasets SEC EDGAR, Wikidata, Sodir toy graphs only
License MIT MIT BSD-3 Apache-2 GPLv3

Pick KGLite when you want Cypher + Python ergonomics + LLM-agent plumbing in one wheel. Pick Kuzu for full openCypher coverage and analytical OLAP throughput. Pick NetworkX when you need its enormous graph-algorithm library and your data fits in RAM. Pick rustworkx when you want NetworkX's API in Rust with no query language. Pick Neo4j Embedded when you've standardised on server-mode Cypher and want the in-process driver for tests.

Quick Start

pip install kglite
import pandas as pd
import kglite

# Three storage modes — pick by graph size:
#   default (in-memory)   — small/medium graphs, fastest queries
#   storage="mapped"      — mmap columns, RAM-friendly as you grow
#   storage="disk", path=…  — 100M+ nodes, Wikidata-scale, loaded lazily
graph = kglite.KnowledgeGraph()

# Bulk-load nodes from a DataFrame.
people = pd.DataFrame({
    "id":   ["alice", "bob", "eve"],
    "name": ["Alice", "Bob", "Eve"],
    "age":  [28, 35, 41],
    "city": ["Oslo", "Bergen", "Trondheim"],
})
graph.add_nodes(people, node_type="Person", unique_id_field="id", node_title_field="name")

# Bulk-load relationships the same way.
knows = pd.DataFrame({"src": ["alice", "bob"], "tgt": ["bob", "eve"]})
graph.add_connections(knows, connection_type="KNOWS",
                      source_type="Person", source_id_field="src",
                      target_type="Person", target_id_field="tgt")

# Query — returns a ResultView (lazy; data stays in Rust until accessed).
for row in graph.cypher("""
    MATCH (p:Person) WHERE p.age > 30
    RETURN p.name AS name, p.city AS city
    ORDER BY p.age DESC
"""):
    print(row['name'], row['city'])

# Or get a pandas DataFrame directly.
df = graph.cypher("MATCH (p:Person) RETURN p.name, p.age ORDER BY p.age", to_df=True)

# Persist to disk and reload.
graph.save("my_graph.kgl")
loaded = kglite.load("my_graph.kgl")

Getting Started guide · Cypher reference · API reference.

Serve it to an agent

Three levels of effort, three levels of capability.

1. One command — any .kgl becomes an MCP server

kglite-mcp-server --graph path/to/graph.kgl

The server exposes cypher_query, graph_overview, schema introspection, structural validators, and source-file tools over MCP stdio. Drop it into Claude Desktop / Cursor / any MCP-capable client and your graph is queryable. Works on every graph kglite can build — your own, Wikidata, Sodir, code-tree.

2. Customise with a YAML manifest

Drop <basename>_mcp.yaml next to the graph (e.g. wikidata_mcp.yaml beside wikidata.kgl) and the server auto-loads it at boot.

name: Wikidata Explorer
source_root: /path/to/related/source        # exposes read/grep/list
extensions:
  embedder: { kind: fastembed, model: bge-small }   # enables text_score()
  csv_http_server: true                              # bulk CSV exports
tools:                                               # inline parameterised Cypher
  - name: who_invented
    cypher: |
      MATCH (i:Q5)-[:P61]->(t {label:$thing})
      RETURN i.label LIMIT 5

No fork required for most customisation. MCP server guide.

3. Teach the agent with bundled skills

Markdown skill files (<basename>.skills/*.md) ship methodology for each tool. The agent reads cypher_query.md at session start to learn your schema conventions, read_code_source.md to know when to drill into source vs. query the graph, etc. Three layers compose: kglite-bundled defaults + your project's .skills/ overrides + operator-declared domain packs. Skills with applies_when: predicates only activate when the graph contains the relevant node types — so a non-code graph never sees read_code_source methodology.

Net effect: the agent comes pre-loaded with how to use your graph, rather than discovering it through trial-and-error. AI Agents guide.

Bundled datasets

Three wrappers turn well-known public sources into queryable graphs without writing a loader. Each handles the fetch + build + cache cycle, returns a KnowledgeGraph you can cypher() against, and respects a per-dataset cooldown so re-running just reloads the cached graph in seconds. KGLite is independent of the upstream organisations — see each module docstring for non-affiliation notes. Datasets guide.

SEC EDGAR

US-public-company knowledge graph from the SEC's free public data — all 14M historical filings + per-filing payload parsing for Form 4 (insider transactions), 13F-HR (institutional holdings), SC 13D (activist stakes), DEF 14A (board composition), XBRL company facts (financial metrics), 10-K Exhibit 21 (subsidiaries), 8-K cover pages (material event Item codes):

from kglite.datasets.sec import SEC

# SEC.fetch — name the forms, the companies, a span; get a graph back.
g = SEC.fetch("/data/sec", ["4", "8-K", "DEF 14A"], ["AAPL", "TSLA"],
              years=2, user_agent="Your Name your@email.com")

# SEC.open — full control: separate filing-index vs. payload spans,
# storage mode, and the include_* flags (XBRL financials, Exhibit 21
# subsidiaries).
g = SEC.open("/data/sec", years=10, detailed=2,
             user_agent="Your Name your@email.com")

# Full universe — drop `companies`; auto-escalates to mode="disk".
g = SEC.open("/data/sec", years="all", detailed=5,
             user_agent="Your Name your@email.com")

Two dozen-plus typed node types — Company, Person, Filing, InsiderTransaction, Holding, InstitutionalHolding, CorporateEvent, Compensation, Role, MetricFact, Subsidiary and more — wired by typed edges, every fact node tracing back to its source filing. Three-tier raw / processed / graph/{mode} cache — raw is immutable, processed regenerates only when its raw source changes, graph/{mode}/ reuses on reopen unless force_rebuild=True. SEC's 10 req/s fair-access policy is enforced by an internal token-bucket rate limiter; the user_agent arg is mandatory (SEC returns 403 without it).

Source data is public domain (US Govt work) — redistribute the built .kgl however you like. SEC guide.

Wikidata

Single-stream latest-truthy.nt.bz2 from dumps.wikimedia.org — parallel-decoded with a bit-level block scanner, parsed, built into a queryable graph in one call:

from kglite.datasets import wikidata

g = wikidata.open("/data/wd")                                    # full graph
g = wikidata.open("/data/wd", entity_limit_millions=100)         # 100M slice
g = wikidata.open("/data/wd", storage="memory",                  # in-memory, fast tests
                  entity_limit_millions=10)

Sodir (Norwegian Offshore Directorate)

Petroleum-domain graph from the public ArcGIS REST FeatureServer at factmaps.sodir.no — 33 baseline node types (Field, Wellbore, Discovery, Licence, Stratigraphy, …), ~480 k nodes, parallel-fetched and built in seconds:

from kglite.datasets import sodir

g = sodir.open("/data/sodir")  # in-memory by default; ~30s first run
g = sodir.open("/data/sodir", complement_blueprint="my_extras.json")  # extend baseline

Two-tier cooldown — cheap row-count probes every 14 days; full per-dataset re-fetch every 30 days. Add a complement blueprint to extend the baseline (new node types, custom edges) without touching the canonical schema.

Recipes

Short patterns for the most-common shapes. Each is self-contained.

Hybrid semantic + structural retrieval

Combine vector similarity (text_score()) with Cypher pattern matching in one query:

graph.cypher("""
    MATCH (c:Chunk)-[:IN_DOC]->(d:Document)
    RETURN c.text, d.title,
           text_score(c.embedding, $query_vec) AS score
    ORDER BY score DESC LIMIT 5
""", params={"query_vec": query_embedding})

Vector embeddings via pip install 'kglite[embed]' (adds fastembed + onnxruntime). Semantic Search guide.

Structural validators — surface data-integrity gaps

Fourteen built-in CALL procedures find the gaps that aren't visible from normal queries: orphan nodes, missing-required-edge violations, two-step cycles, duplicate titles, parallel edges, cardinality violations, more. They compose with the rest of Cypher.

# Wellbores in our sodir graph that lack a production licence
graph.cypher("""
    CALL missing_required_edge({type: 'Wellbore', edge: 'IN_LICENCE'}) YIELD node
    RETURN node.id, node.title
""")

missing_required_edge and missing_inbound_edge validate the (type, edge) direction against the graph's actual schema and refuse to execute when misused. → Full procedure list in the Cypher reference.

Graph algorithms

Shortest path (BFS or Dijkstra), centrality, community detection, clustering — all in Cypher:

graph.cypher("""
    MATCH path = shortestPath((a:User {name:'Alice'})-[*]-(b:User {name:'Eve'}))
    RETURN path
""")

Graph algorithms guide · Traversal patterns · Recipes index.

Examples

The examples/ directory has runnable, self-contained artifacts:

  • codebase_to_claude_mcp.ipynb — clone an open-source repo, parse it into a code knowledge graph, register a workspace MCP server in Claude Desktop.
  • sec_to_claude_mcp.ipynb — build a graph of SEC filings with SEC.fetch, query it, register it as a Claude Desktop MCP server.
  • open_source_workspace_mcp.yaml — annotated workspace-mode manifest for the github-clone-tracker pattern. Walked through in the workspace manifest example.
  • legal_graph.py — end-to-end add_nodes / add_connections from pandas DataFrames, covering laws, regulations, court decisions with citation edges.
  • code_graph.py — build a code knowledge graph from a source directory via code_tree.build.
  • spatial_graph.py — declarative CSV→graph loading via a JSON blueprint; lat/lon coordinates and pipeline-path traversal queries.
  • crates/kglite-mcp-server/ — Rust-native single-binary MCP server (built on rmcp + the mcp-methods framework). Reach for it when the manifest doesn't express what you need; the binary is the reference for layering domain-specific tools on top of the generic surface.

Benchmarks

KGLite builds and queries Wikidata-scale graphs on a laptop. Measured with bench/wiki_benchmark.py on an M-series MacBook.

Ingest — full pipeline from compressed N-Triples to a queryable graph:

dataset triples nodes edges ingest throughput peak RAM
wiki100m 100 M 938 K 748 K 29 s 3.4 M triples/s 1.3 GB
wiki500m 500 M 5.6 M 6.7 M 157 s 3.2 M triples/s 5.2 GB
wiki1000m 1 B 14.7 M 15.4 M 395 s 2.5 M triples/s 7.0 GB

Reloading a saved 1 B-triple graph from disk (7 GB on-disk): 3.5 s.

Query latency on the 1 B-triple graph (mapped storage):

Cypher wall
MATCH (n)-[:P31]->(:human) RETURN count(n) — typed aggregation 0.5 ms
MATCH (a)-[:P31]->(b)-[:P279]->(c) LIMIT 10 — 2-hop typed 0.9 ms
MATCH (a)-[:P31]->(b {nid:'Q64'}) RETURN a LIMIT 20 — pivot 1 ms
MATCH (a)-[:P31]->(:human) MATCH (a)-[:P27]->(c) LIMIT 10 — join 44 ms

Disk and mapped storage build at the same speed; mapped wins on small-result queries (in-memory inverted index), disk wins on unbounded typed traversals (sorted-CSR mmap I/O). No server, no tuning, same Python process as your code.

Key Features

Quick reference. Each links into the appropriate guide.

Feature Description
Cypher MATCH, CREATE, SET, DELETE, MERGE, UNION/INTERSECT/EXCEPT, aggregations (incl. median, percentile_cont, variance), reduce(), ORDER BY, LIMIT, SKIP
Semantic search Vector embeddings + text_score() for similarity ranking. Opt-in via pip install 'kglite[embed]'.
Text predicates text_edit_distance, text_normalize, text_jaccard, text_ngrams, text_contains_any / text_starts_with_any
Graph algorithms Shortest path (BFS or Dijkstra), centrality, community detection, clustering
Structural validators 14 CALL procedures: orphan_node, missing_required_edge, cycle_2step, inverse_violation, cardinality_violation, parallel_edges, null_property, more — agent-discoverable integrity checks composable with Cypher
Spatial Coordinates, WKT geometry, distance + containment, geometry primitives (geom_buffer, geom_convex_hull, geom_union/intersection/difference, geom_is_valid, geom_length), kg_knn k-nearest-neighbour
Timeseries Time-indexed data with ts_*() Cypher functions
Bulk loading add_nodes / add_connections for DataFrames
Blueprints Declarative CSV-to-graph loading via JSON config
Import/Export Save/load snapshots (.kgl), GraphML, CSV export
AI integration describe() introspection, MCP server, agent prompts
Code analysis 14-language tree-sitter parser (kglite.code_tree) — functions, classes, calls, imports, web-framework routes

Documentation

Full docs at kglite.readthedocs.io:

Getting started

Querying

Loading data

Domain features

  • Spatial — WKT geometry, lat/lon, k-nearest-neighbour
  • Timeseries — time-indexed values, ts_*() functions

Agent integration

Reference

Requirements

Python 3.10+ (CPython) | macOS (ARM), Linux (x86_64/aarch64), Windows (x86_64) | pandas >= 1.5

License

MIT — see LICENSE for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kglite-0.9.52-cp310-abi3-win_amd64.whl (11.0 MB view details)

Uploaded CPython 3.10+Windows x86-64

kglite-0.9.52-cp310-abi3-manylinux_2_39_x86_64.whl (11.1 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.39+ x86-64

kglite-0.9.52-cp310-abi3-macosx_11_0_arm64.whl (10.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file kglite-0.9.52-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: kglite-0.9.52-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 11.0 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kglite-0.9.52-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1e1a08f400da7281acfe4ef301af794a666628500268bef3cff1bd509ccd8921
MD5 e5bcbcaecfb67f06313a875f7ae4a4c0
BLAKE2b-256 9d6a4cba5d46c6b50b161b86dfbe47e06fec925ed822a52769afff6fe0d65e32

See more details on using hashes here.

File details

Details for the file kglite-0.9.52-cp310-abi3-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for kglite-0.9.52-cp310-abi3-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 37b5c4c1582565dc49bbe361347c2cec5cd5845bba8f98850cbc4d636cff5aef
MD5 bf419a9be89bba020199a380fcfc3863
BLAKE2b-256 e9ac19f8141dc417adb4e30c300a40d6134190c898361489df29f6d012a7d0e2

See more details on using hashes here.

File details

Details for the file kglite-0.9.52-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kglite-0.9.52-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 63e1ca94a7d05d1d5e5c0d80d6ad3015bacfd267f073463ae0e57ae2b216eb94
MD5 3def4ecf9794767edd2e5f0ee53d22d3
BLAKE2b-256 7287c0e39671c68e59df04cf2228cf1c89472d333a801a89520345dcbf0b0b18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page