Skip to main content

A high-performance graph database library with Python bindings written in Rust

Project description

KGLite โ€” Knowledge graph for Python, built for LLM agents

PyPI version Python versions License: MIT Docs

KGLite is an embedded, Cypher-queryable knowledge graph for Python, built so you can hand it to an LLM agent. pip install kglite and point kglite.code_tree.build(".") at any source directory โ€” your first queryable graph in seconds. It ships with a bundled MCP server, a describe() method that emits a system-prompt-shaped schema, and structural validators that compose with Cypher.

๐Ÿš€ See it end-to-end: codebase โ†’ Claude in ~50 lines

examples/codebase_to_claude_mcp.ipynb clones a GitHub repo, parses it into a code knowledge graph, runs a few Cypher queries, then registers a workspace MCP server in Claude Desktop. Closes with a screenshot of Claude calling repo_management โ†’ graph_overview โ†’ cypher_query against the live graph.

๐Ÿฆ Or: every US public company, queryable in one call

from kglite.datasets.sec import SEC
g = SEC.open("./sec", years=10, detailed=2,
             user_agent="Your Name your@email.com")

Pulls SEC EDGAR's full filing index (~14M filings since 1993) plus 2 years of deep payloads โ€” Form 4 insider transactions, 13F institutional holdings, SC 13D activist stakes, DEF 14A board composition, FSNDS XBRL financials, Exhibit 21 subsidiaries, 8-K Item codes. 11 node types, 15 edge types, queryable with Cypher. Public-domain data (US Govt work). Scope with cik_list=[...] for an S&P-500-sized graph in ~10 minutes. โ†’ SEC guide.

Use cases

KGLite is shape-agnostic โ€” the agent-facing surface is the same whether the graph holds your legal precedents, a Wikidata slice, your SQL warehouse, a RAG corpus, or a parsed codebase.

  • ๐Ÿฆ SEC EDGAR in one call. SEC.open(path, years=10, detailed=2, user_agent="...") builds a US-public-company knowledge graph from the SEC's free data: companies, filings, insider transactions (Form 4), institutional holdings (13F), XBRL financial metrics, activist stakes (SC 13D), board composition (DEF 14A), subsidiary trees (Exhibit 21), 8-K material events. 11 node types, 15 edge types, three-tier raw / processed / graph cache that never re-fetches. Scope with cik_list=[...] for an S&P-500 graph in ~10 minutes. โ†’ SEC guide.
  • ๐Ÿ›๏ธ Domain knowledge for agents. Legal precedents + citations, regulatory rules, medical ontologies, manufacturing BOMs, scientific catalogues โ€” anything with structure becomes a queryable graph an MCP-capable agent can reason over. See the legal-graph example for a Norwegian-Supreme-Court walk-through (laws + decisions + citation edges + judge metadata).
  • ๐Ÿ“Š Business data โ†’ queryable graph. Any tabular source โ€” SQL, CSV, Parquet, REST API responses, pandas DataFrames โ€” goes straight in via add_nodes(df, ...) and add_connections(df, ...). Layer a graph on top of your warehouse and the agent reasons over the relationships without you writing a server. โ†’ Data Loading guide.
  • ๐ŸŒ Public datasets, one line. wikidata.open(path) and sodir.open(path) handle the fetch + build + cache cycle. Run Cypher queries on a billion-edge Wikidata graph from a 16 GB laptop โ€” mapped/disk storage means you can operate and query datasets that won't fit in RAM. โ†’ See Bundled datasets below.
  • ๐Ÿ“š RAG with structure. Documents, chunks, entities, and the edges between them in one graph. Combine text_score() vector similarity with Cypher traversal โ€” "find court cases semantically similar to my fact pattern, then walk one hop to related precedents" โ€” hybrid retrieval in one query, no second vector DB. โ†’ Semantic Search guide.
  • ๐Ÿ“‚ Codebase analysis. kglite.code_tree.build(".") parses 13 languages into Function / Class / Module / Route nodes with web-framework route detection (Flask, FastAPI, Django). See the notebook above for the full code โ†’ Claude Desktop workflow. โ†’ Code analysis guide.

Why Cypher?

A question every investor asks: which insiders are selling, and at what price? Against raw SEC XML you parse 1000s of Form 4 documents, join on issuer CIK, filter by transaction code. Against a graph it's one query:

-- Insider sells at Apple (CIK 320193), most recent first
MATCH (c:Company {cik: 320193})-[:HAS_INSIDER]->(p:Person)
      <-[:OF_PERSON]-(t:Transaction {transaction_code: 'S'})
RETURN p.display_name, t.transaction_date, t.shares, t.price_per_share
ORDER BY t.transaction_date DESC LIMIT 10

Three node types (Company, Person, Transaction), two edge types (HAS_INSIDER, OF_PERSON), pattern-matched and joined in one expression. The same shape composes into harder questions โ€” swap :HAS_INSIDER for :HOLDS and you're walking institutional positions; add :SERVES_ON_BOARD and you're checking who's an insider AND a director. Cypher pays off most when the data has real structure and your questions traverse it.

How it compares

KGLite Kuzu NetworkX rustworkx Neo4j Embedded
Install pip install kglite pip install kuzu pip install networkx pip install rustworkx JVM + Java deps
Query language Cypher (subset) Cypher (full) Python API Python API Cypher (full)
Storage in-mem ยท mmap ยท disk (1B+ edges) in-mem ยท disk (columnar) in-mem in-mem in-mem ยท disk (JVM)
Bulk-load from pandas one-liner via Arrow manual manual via driver
Bundled MCP server for LLM agents โœ… โ€” โ€” โ€” โ€”
describe() schema for LLM prompts โœ… โ€” โ€” โ€” โ€”
Codebase โ†’ graph parser 13 languages, route detection โ€” โ€” โ€” โ€”
Bundled public datasets SEC EDGAR, Wikidata, Sodir โ€” toy graphs only โ€” โ€”
License MIT MIT BSD-3 Apache-2 GPLv3

Pick KGLite when you want Cypher + Python ergonomics + LLM-agent plumbing in one wheel. Pick Kuzu for full openCypher coverage and analytical OLAP throughput. Pick NetworkX when you need its enormous graph-algorithm library and your data fits in RAM. Pick rustworkx when you want NetworkX's API in Rust with no query language. Pick Neo4j Embedded when you've standardised on server-mode Cypher and want the in-process driver for tests.

Quick Start

pip install kglite
import pandas as pd
import kglite

# Three storage modes โ€” pick by graph size:
#   default (in-memory)   โ€” small/medium graphs, fastest queries
#   storage="mapped"      โ€” mmap columns, RAM-friendly as you grow
#   storage="disk", path=โ€ฆ  โ€” 100M+ nodes, Wikidata-scale, loaded lazily
graph = kglite.KnowledgeGraph()

# Bulk-load nodes from a DataFrame.
people = pd.DataFrame({
    "id":   ["alice", "bob", "eve"],
    "name": ["Alice", "Bob", "Eve"],
    "age":  [28, 35, 41],
    "city": ["Oslo", "Bergen", "Trondheim"],
})
graph.add_nodes(people, node_type="Person", unique_id_field="id", node_title_field="name")

# Bulk-load relationships the same way.
knows = pd.DataFrame({"src": ["alice", "bob"], "tgt": ["bob", "eve"]})
graph.add_connections(knows, connection_type="KNOWS",
                      source_type="Person", source_id_field="src",
                      target_type="Person", target_id_field="tgt")

# Query โ€” returns a ResultView (lazy; data stays in Rust until accessed).
for row in graph.cypher("""
    MATCH (p:Person) WHERE p.age > 30
    RETURN p.name AS name, p.city AS city
    ORDER BY p.age DESC
"""):
    print(row['name'], row['city'])

# Or get a pandas DataFrame directly.
df = graph.cypher("MATCH (p:Person) RETURN p.name, p.age ORDER BY p.age", to_df=True)

# Persist to disk and reload.
graph.save("my_graph.kgl")
loaded = kglite.load("my_graph.kgl")

โ†’ Getting Started guide ยท Cypher reference ยท API reference.

Serve it to an agent

Three levels of effort, three levels of capability.

1. One command โ€” any .kgl becomes an MCP server

kglite-mcp-server --graph path/to/graph.kgl

The server exposes cypher_query, graph_overview, schema introspection, structural validators, and source-file tools over MCP stdio. Drop it into Claude Desktop / Cursor / any MCP-capable client and your graph is queryable. Works on every graph kglite can build โ€” your own, Wikidata, Sodir, code-tree.

2. Customise with a YAML manifest

Drop <basename>_mcp.yaml next to the graph (e.g. wikidata_mcp.yaml beside wikidata.kgl) and the server auto-loads it at boot.

name: Wikidata Explorer
source_root: /path/to/related/source        # exposes read/grep/list
extensions:
  embedder: { kind: fastembed, model: bge-small }   # enables text_score()
  csv_http_server: true                              # bulk CSV exports
tools:                                               # inline parameterised Cypher
  - name: who_invented
    cypher: |
      MATCH (i:Q5)-[:P61]->(t {label:$thing})
      RETURN i.label LIMIT 5

No fork required for most customisation. โ†’ MCP server guide.

3. Teach the agent with bundled skills

Markdown skill files (<basename>.skills/*.md) ship methodology for each tool. The agent reads cypher_query.md at session start to learn your schema conventions, read_code_source.md to know when to drill into source vs. query the graph, etc. Three layers compose: kglite-bundled defaults + your project's .skills/ overrides + operator-declared domain packs. Skills with applies_when: predicates only activate when the graph contains the relevant node types โ€” so a non-code graph never sees read_code_source methodology.

Net effect: the agent comes pre-loaded with how to use your graph, rather than discovering it through trial-and-error. โ†’ AI Agents guide.

Bundled datasets

Three wrappers turn well-known public sources into queryable graphs without writing a loader. Each handles the fetch + build + cache cycle, returns a KnowledgeGraph you can cypher() against, and respects a per-dataset cooldown so re-running just reloads the cached graph in seconds. KGLite is independent of the upstream organisations โ€” see each module docstring for non-affiliation notes. โ†’ Datasets guide.

SEC EDGAR

US-public-company knowledge graph from the SEC's free public data โ€” all 14M historical filings + per-filing payload parsing for Form 4 (insider transactions), 13F-HR (institutional holdings), SC 13D (activist stakes), DEF 14A (board composition), FSNDS NUM.tsv (XBRL financial metrics), 10-K Exhibit 21 (subsidiaries), 8-K cover pages (material event Item codes):

from kglite.datasets.sec import SEC

# Default config: 10yr filing index + 2yr deep payload at mode="mapped"
g = SEC.open("/data/sec", user_agent="Your Name your@email.com")

# Watchlist scope โ€” 500 CIKs build in ~10 minutes
g = SEC.open("/data/sec", cik_list=[320193, 789019, ...],
             user_agent="Your Name your@email.com")

# Full universe โ€” auto-escalates to mode="disk" at predicted >16 GB
g = SEC.open("/data/sec", years="all", detailed=5,
             user_agent="Your Name your@email.com")

11 node types (Company, Filing, Person, Transaction, Institutional- Manager, Security, Subsidiary, MetricFact, Event, Stake, Director), 15 edge types. Three-tier raw / processed / graph/{mode} cache โ€” raw is immutable, processed regenerates only when its raw source changes, graph/{mode}/ reuses on reopen unless force_rebuild=True. SEC's 10 req/s fair-access policy is enforced by an internal token-bucket rate limiter; the user_agent arg is mandatory (SEC returns 403 without it).

Source data is public domain (US Govt work) โ€” redistribute the built .kgl however you like. โ†’ SEC guide.

Wikidata

Single-stream latest-truthy.nt.bz2 from dumps.wikimedia.org โ€” parallel-decoded with a bit-level block scanner, parsed, built into a queryable graph in one call:

from kglite.datasets import wikidata

g = wikidata.open("/data/wd")                                    # full graph
g = wikidata.open("/data/wd", entity_limit_millions=100)         # 100M slice
g = wikidata.open("/data/wd", storage="memory",                  # in-memory, fast tests
                  entity_limit_millions=10)

Sodir (Norwegian Offshore Directorate)

Petroleum-domain graph from the public ArcGIS REST FeatureServer at factmaps.sodir.no โ€” 33 baseline node types (Field, Wellbore, Discovery, Licence, Stratigraphy, โ€ฆ), ~480 k nodes, parallel-fetched and built in seconds:

from kglite.datasets import sodir

g = sodir.open("/data/sodir")  # in-memory by default; ~30s first run
g = sodir.open("/data/sodir", complement_blueprint="my_extras.json")  # extend baseline

Two-tier cooldown โ€” cheap row-count probes every 14 days; full per-dataset re-fetch every 30 days. Add a complement blueprint to extend the baseline (new node types, custom edges) without touching the canonical schema.

Recipes

Short patterns for the most-common shapes. Each is self-contained.

Hybrid semantic + structural retrieval

Combine vector similarity (text_score()) with Cypher pattern matching in one query:

graph.cypher("""
    MATCH (c:Chunk)-[:IN_DOC]->(d:Document)
    RETURN c.text, d.title,
           text_score(c.embedding, $query_vec) AS score
    ORDER BY score DESC LIMIT 5
""", params={"query_vec": query_embedding})

Vector embeddings via pip install 'kglite[embed]' (adds fastembed + onnxruntime). โ†’ Semantic Search guide.

Structural validators โ€” surface data-integrity gaps

Fourteen built-in CALL procedures find the gaps that aren't visible from normal queries: orphan nodes, missing-required-edge violations, two-step cycles, duplicate titles, parallel edges, cardinality violations, more. They compose with the rest of Cypher.

# Wellbores in our sodir graph that lack a production licence
graph.cypher("""
    CALL missing_required_edge({type: 'Wellbore', edge: 'IN_LICENCE'}) YIELD node
    RETURN node.id, node.title
""")

missing_required_edge and missing_inbound_edge validate the (type, edge) direction against the graph's actual schema and refuse to execute when misused. โ†’ Full procedure list in the Cypher reference.

Graph algorithms

Shortest path (BFS or Dijkstra), centrality, community detection, clustering โ€” all in Cypher:

graph.cypher("""
    MATCH path = shortestPath((a:User {name:'Alice'})-[*]-(b:User {name:'Eve'}))
    RETURN path
""")

โ†’ Graph algorithms guide ยท Traversal patterns ยท Recipes index.

Examples

The examples/ directory has runnable, self-contained artifacts:

  • codebase_to_claude_mcp.ipynb โ€” clone a famous open-source repo, parse it into a code knowledge graph, register a workspace MCP server in Claude Desktop. End-to-end in ~50 lines.
  • open_source_workspace_mcp.yaml โ€” annotated workspace-mode manifest for the github-clone-tracker pattern. Walked through in the workspace manifest example.
  • legal_graph.py โ€” end-to-end add_nodes / add_connections from pandas DataFrames, covering laws, regulations, court decisions with citation edges.
  • code_graph.py โ€” build a code knowledge graph from a source directory via code_tree.build.
  • spatial_graph.py โ€” declarative CSVโ†’graph loading via a JSON blueprint; lat/lon coordinates and pipeline-path traversal queries.
  • crates/kglite-mcp-server/ โ€” Rust-native single-binary MCP server (built on rmcp + the mcp-methods framework). Reach for it when the manifest doesn't express what you need; the binary is the reference for layering domain-specific tools on top of the generic surface.

Benchmarks

KGLite builds and queries Wikidata-scale graphs on a laptop. Measured with bench/wiki_benchmark.py on an M-series MacBook.

Ingest โ€” full pipeline from compressed N-Triples to a queryable graph:

dataset triples nodes edges ingest throughput peak RAM
wiki100m 100 M 938 K 748 K 29 s 3.4 M triples/s 1.3 GB
wiki500m 500 M 5.6 M 6.7 M 157 s 3.2 M triples/s 5.2 GB
wiki1000m 1 B 14.7 M 15.4 M 395 s 2.5 M triples/s 7.0 GB

Reloading a saved 1 B-triple graph from disk (7 GB on-disk): 3.5 s.

Query latency on the 1 B-triple graph (mapped storage):

Cypher wall
MATCH (n)-[:P31]->(:human) RETURN count(n) โ€” typed aggregation 0.5 ms
MATCH (a)-[:P31]->(b)-[:P279]->(c) LIMIT 10 โ€” 2-hop typed 0.9 ms
MATCH (a)-[:P31]->(b {nid:'Q64'}) RETURN a LIMIT 20 โ€” pivot 1 ms
MATCH (a)-[:P31]->(:human) MATCH (a)-[:P27]->(c) LIMIT 10 โ€” join 44 ms

Disk and mapped storage build at the same speed; mapped wins on small-result queries (in-memory inverted index), disk wins on unbounded typed traversals (sorted-CSR mmap I/O). No server, no tuning, same Python process as your code.

Key Features

Quick reference. Each links into the appropriate guide.

Feature Description
Cypher MATCH, CREATE, SET, DELETE, MERGE, UNION/INTERSECT/EXCEPT, aggregations (incl. median, percentile_cont, variance), reduce(), ORDER BY, LIMIT, SKIP
Semantic search Vector embeddings + text_score() for similarity ranking. Opt-in via pip install 'kglite[embed]'.
Text predicates text_edit_distance, text_normalize, text_jaccard, text_ngrams, text_contains_any / text_starts_with_any
Graph algorithms Shortest path (BFS or Dijkstra), centrality, community detection, clustering
Structural validators 14 CALL procedures: orphan_node, missing_required_edge, cycle_2step, inverse_violation, cardinality_violation, parallel_edges, null_property, more โ€” agent-discoverable integrity checks composable with Cypher
Spatial Coordinates, WKT geometry, distance + containment, geometry primitives (geom_buffer, geom_convex_hull, geom_union/intersection/difference, geom_is_valid, geom_length), kg_knn k-nearest-neighbour
Timeseries Time-indexed data with ts_*() Cypher functions
Bulk loading add_nodes / add_connections for DataFrames
Blueprints Declarative CSV-to-graph loading via JSON config
Import/Export Save/load snapshots (.kgl), GraphML, CSV export
AI integration describe() introspection, MCP server, agent prompts
Code analysis 13-language tree-sitter parser (kglite.code_tree) โ€” functions, classes, calls, imports, web-framework routes

Documentation

Full docs at kglite.readthedocs.io:

Getting started

Querying

Loading data

Domain features

  • Spatial โ€” WKT geometry, lat/lon, k-nearest-neighbour
  • Timeseries โ€” time-indexed values, ts_*() functions

Agent integration

Reference

Requirements

Python 3.10+ (CPython) | macOS (ARM), Linux (x86_64/aarch64), Windows (x86_64) | pandas >= 1.5

License

MIT โ€” see LICENSE for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

kglite-0.9.44-cp310-abi3-win_amd64.whl (10.3 MB view details)

Uploaded CPython 3.10+Windows x86-64

kglite-0.9.44-cp310-abi3-manylinux_2_39_x86_64.whl (10.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.39+ x86-64

kglite-0.9.44-cp310-abi3-macosx_11_0_arm64.whl (9.4 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file kglite-0.9.44-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: kglite-0.9.44-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 10.3 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for kglite-0.9.44-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 af71e179bd4b12f1c47765163ea1b1ad14f445426aeffb4d404baa79b3a9262c
MD5 c3f253a55ffc43a20b6b5e1d08e92479
BLAKE2b-256 716fa46d567392ae26e560838e3c87dd93b76abe79a9916e3ca1c8d4954d7f0c

See more details on using hashes here.

File details

Details for the file kglite-0.9.44-cp310-abi3-manylinux_2_39_x86_64.whl.

File metadata

File hashes

Hashes for kglite-0.9.44-cp310-abi3-manylinux_2_39_x86_64.whl
Algorithm Hash digest
SHA256 b2b7b6acddf25c4c0fdbf4c11a4f136c92af64e4ece64d84154afc60a7e694b8
MD5 92245e7756b6bfd02ae3db6cc243f4a4
BLAKE2b-256 92e907815c7ad8e8f163eb4106f29d65bf74f0494f6d5e256a23722a4ee1f6f7

See more details on using hashes here.

File details

Details for the file kglite-0.9.44-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for kglite-0.9.44-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 35ea514463767d9bed6b16cfc8a3e464cce96cd0c0a3b73d8bc489940374b235
MD5 e688e0a138686c11a114e1f0bbe2f8b8
BLAKE2b-256 5651aa8c6a1cb8465af44ebee3a29b3cff82b98d9c6d6710083818e4a4469f83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page