Knowledge graph compression engine: parallel-decodable subgraph capsules for fast, storage-efficient KG queries.
Project description
KGZip
A compression engine for knowledge graphs. KGZip takes a knowledge graph and splits it into small, independently-loadable pieces called capsules, so that when you ask a question about one part of the graph, you only read that part — not the whole thing. The result is a store that is smaller on disk and lets you query large graphs without loading them entirely into memory.
This README assumes no prior knowledge of knowledge graphs. If you already know the basics, jump to Quickstart or the API reference.
New here? Start with the concepts
What is a knowledge graph?
A knowledge graph (KG) is just data stored as things and the relationships between them.
- A node is a thing: a drug, a disease, a person, a movie.
- An edge is a relationship connecting two nodes: Aspirin treats Headache.
(Aspirin) --treats--> (Headache) --associated_with--> (GeneX)
Here Aspirin, Headache, and GeneX are nodes; treats and associated_with
are edges (also called relations). Each node can also carry attributes
(properties), e.g. Aspirin's { "formula": "C9H8O4" }.
That's the whole idea. Real KGs just have many more nodes and edges (thousands to billions), often describing a domain like medicine, finance, or social networks.
What problem does KGZip solve?
When a graph is large, two things get painful:
- Storage — keeping the whole graph around costs space.
- Querying — to answer "what is near node X?", naive tools load or scan the entire graph, even though you only care about a tiny neighbourhood.
KGZip pre-organises the graph into capsules (clusters of closely-related nodes) and writes a small manifest (an index). A query then loads only the capsules it needs. Think of it like a book with chapters and a table of contents: to read about one topic you open one chapter, not the entire book.
The golden rule: KGZip is a read replica
Your original graph (in a file, or in a database like Neo4j) is always the source of truth — the "master". KGZip builds a compressed copy from it for fast reads. KGZip never modifies your original data. If the KGZip store is ever lost or corrupted, you can always rebuild it from the master.
Is it lossless?
Yes. KGZip v1 is lossless: if you compress a graph and then ask for all of its nodes back, you get every node and every edge exactly as they were. Capsules store boundary-crossing edges (and a small "halo" of neighbouring nodes) precisely so that nothing is lost when the pieces are reassembled.
Install
pip install kgzip
# optional: to read directly from a Neo4j database
pip install "kgzip[neo4j]"
From source (for development):
git clone <repo-url> && cd KGZip
pip install -e ".[dev]"
pytest # run the test suite
Requires Python ≥ 3.8. Works in plain scripts and in Jupyter notebooks.
Quickstart
Five lines to compress a graph and query it:
import networkx as nx
from kgzip import KGZipStore
store = KGZipStore("./my_store") # 1. where the compressed store lives
store.compress(nx.karate_club_graph()) # 2. build capsules from a graph
result = store.query(["0", "1"], depth=2) # 3. ask: what's around nodes 0 and 1?
print(result.subgraph.meta.node_count) # 4. how many nodes came back
What just happened, line by line:
KGZipStore("./my_store")— open (or prepare to create) a store in that folder. Nothing is read or written yet.compress(...)— read the graph, cluster it into capsules, and write the capsule files plus a manifest into./my_store.query(["0","1"], depth=2)— find the capsules containing nodes"0"and"1", expand outwarddepthhops, decode just those capsules, and merge them.- The answer is a
QueryResult; its.subgraphis a normal graph you can inspect.
Loading data from different sources
compress() accepts several common graph formats. You don't convert anything
yourself — KGZip detects the type and reads it.
store.compress(nx.karate_club_graph()) # a NetworkX graph object (in memory)
store.compress("graph.ttl") # RDF / Turtle file (.ttl, .n3, .nt)
store.compress("graph.jsonld") # JSON-LD file
store.compress("edges.csv") # CSV edge list (see format below)
store.compress("bolt://localhost:7687") # a live Neo4j database (see below)
CSV edge-list format
The simplest way to bring your own data. One row per edge:
src,dst,relation,weight
Aspirin,Headache,treats,1.0
Headache,GeneX,associated_with,1.0
src,dst— required: the two node IDs the edge connects.relation— optional: the edge type (defaults torelated_to).weight— optional: a number (defaults to1.0).src_type,dst_type— optional: node categories (defaultunknown).- Any other columns are kept as edge attributes.
Reading directly from Neo4j
Neo4j is a popular graph database. KGZip can read a full
snapshot of it over Bolt (Neo4j's network connection protocol — the bolt://
address is just "where the database is listening"). You supply the connection URL
and your login:
from kgzip import KGZipStore
# If your Neo4j has no authentication:
store = KGZipStore("./my_store")
store.compress("bolt://localhost:7687")
Most Neo4j databases require a username and password. Pass them via the store's
IngestionConfig:
from kgzip import KGZipStore
from kgzip.models import KGZipConfig, IngestionConfig
config = KGZipConfig(
ingestion=IngestionConfig(
neo4j_auth=("neo4j", "your-password"), # (username, password)
neo4j_database=None, # database name; None = server default
neo4j_node_label=None, # only nodes with this label; None = all
),
)
store = KGZipStore("./my_store", config)
store.compress("bolt://localhost:7687") # one-time snapshot read + compress
KGZip reads every node (id(n) becomes the node ID, the first label becomes the
node type, properties become attributes) and every relationship. It only reads —
your Neo4j data is never changed.
How a query works (the mental model)
compress() once: query() many times:
master graph query(["X"], depth=2)
│ │
▼ ▼
┌──────────┐ find capsule holding "X"
│ capsules │ ◄──── reads only ──── + its neighbour capsules
│ + manifest │
└──────────┘ ▼
(on local disk) decode those capsules, merge
│
▼
QueryResult.subgraph
depthcontrols how far out from your seed nodes to reach.depth=1is "the seed nodes and their immediate surroundings"; higher depth pulls in more.- KGZip retrieves at capsule granularity — it returns whole clusters, so the result is a superset of the exact neighbourhood (great recall; some extra nodes). Asking for all nodes always returns the complete original graph (lossless).
API reference
KGZipStore is the only class you need. Everything else is internal.
Creating a store
KGZipStore(path, config=None)
path— folder for the compressed store (created on firstcompress()).config— optionalKGZipConfigto tune clustering/compression (see below).- The manifest is loaded lazily (on your first
query()), so creating a store is instant and does no I/O. - Works as a context manager:
with KGZipStore(path) as store: ....
compress(graph, *, config=None) → CapsuleStoreRef
Builds the compressed store from a graph. Accepts any supported source (NetworkX
object, file path, or bolt:// URL). Steps it runs for you: ingest → cluster →
encode → write capsules → write manifest (written last, as the safe commit point).
- Returns a
CapsuleStoreRefdescribing the new store:manifest_path,capsule_count,total_bytes,gcs_summary,store_version,created_at. - Idempotent by default (
overwrite=False): re-compressing the same graph skips capsules whose content hasn't changed. - Thread/process-safe: takes a file lock so two compresses can't clobber each other.
ref = store.compress("edges.csv")
print(ref.capsule_count, ref.total_bytes)
query(node_ids, depth=1, **kwargs) → QueryResult
Fetch the subgraph around one or more seed nodes.
node_ids— list of node IDs to start from (must be non-empty).depth— how many hops to expand.depth=1is the seeds and their immediate surroundings; higher pulls in more.depth=None= unbounded (follow the graph until nothing new is reachable — the whole connected subgraph).- Optional keyword arguments:
trim: bool = False— token control.Falsereturns the full capsule contents (a superset of the neighbourhood — more context).Trueprunes the result down to the exactdepth-hop neighbourhood of your seeds. Trimming is lossless relative to the query (it never drops anything withindepthhops) and can cut output ~100× on large graphs. See Saving tokens.max_capsules: int = 50— safety cap on how many capsules one query may load. Set higher, orNone, to fetch large/complete subgraphs. If the cap limits a result,QueryResult.truncatedis set toTrue(never a silent partial answer).relation_filter: list[str]— keep only edges of these relation types.consistency: "eventual" | "strict"—"strict"re-fetches stale parts from the master viamaster_kg_fninstead of serving possibly-stale capsule data.timeout_ms: int— max time to wait for parallel decoding (default 5000).master_kg_fn: Callable— required whenconsistency="strict"; you write a functionnode_ids -> fresh subgraphthat fetches from your master.
Returns a QueryResult:
| Field | Meaning |
|---|---|
subgraph |
the merged result graph (a NormalizedGraph) |
capsules_loaded |
how many capsules were read |
latency_ms |
how long the query took |
stale_capsules |
IDs of capsules flagged stale |
fallback_used |
True if the master was consulted (strict mode) |
query_node_ids_not_found |
seed IDs that weren't in the store |
truncated |
True if max_capsules limited the result (it's incomplete) |
# Token-lean: exact 2-hop neighbourhood, only "treats" edges
res = store.query(["Aspirin"], depth=2, trim=True, relation_filter=["treats"])
# Agent escape hatch: not satisfied? fetch everything reachable, no caps
res = store.query(["Aspirin"], depth=None, max_capsules=None)
if res.truncated:
print("result was capped — raise max_capsules")
Iterative deepening (for AI agents)
The defaults are safe (you never get less than the true neighbourhood). An agent can start cheap and widen only when needed:
res = store.query(seeds, depth=1, trim=True) # cheap, few tokens
if not_enough(res):
res = store.query(seeds, depth=3, trim=True) # go deeper
if still_not_enough(res):
res = store.query(seeds, depth=None, max_capsules=None) # the whole reachable graph
sync(master_graph=None) → SyncReport
Keep the store consistent with a changed master.
sync()with no argument → marks all capsules stale (they'll be treated as out-of-date until rebuilt).sync(updated_graph)→ re-compresses the store from the updated graph.- Returns a
SyncReport:stale_count,re_encoded_count,skipped_count,sync_duration_ms.
status() → StoreStatus
A safe, never-raises health check.
- Returns
StoreStatus:exists,capsule_count,stale_count,total_bytes,store_version,last_encoded_at. exists=Falsemeans nothing has been compressed yet.
if not store.status().exists:
store.compress(my_graph)
Configuration
Tune how KGZip clusters and compresses. Defaults are sensible — change these only if you need to.
from kgzip.models import KGZipConfig, DecisionConfig, StorageConfig
config = KGZipConfig(
decision=DecisionConfig(
max_capsule_nodes=500, # biggest a capsule may get (bigger ones are split)
min_capsule_nodes=5, # smallest; tiny clusters merge into a neighbour
spectral_k=8, # size of each capsule's structural "fingerprint"
random_seed=42, # makes clustering reproducible
),
storage=StorageConfig(
base_path="./my_store",
compression="zstd", # "zstd" (best) | "gzip" | "none"
compression_level=3, # 1–19 for zstd (higher = smaller, slower)
overwrite=False, # True = always re-encode, even if unchanged
),
)
store = KGZipStore("./my_store", config)
Errors
Every error KGZip raises is a subclass of kgzip.KGZipError and carries a
message plus a context dict for debugging. Common ones:
| Exception | When |
|---|---|
EmptyGraphError |
the input graph has no nodes |
SchemaError |
a CSV is missing required src/dst columns |
SoftDependencyError |
an optional library (e.g. neo4j) isn't installed |
ConnectionError |
a Neo4j database couldn't be reached |
StoreNotFoundError |
you queried before compressing |
CorruptionError / VersionError |
a capsule file is damaged or wrong version |
QueryError |
bad query input (e.g. empty node_ids) |
from kgzip import KGZipError
try:
store.query([], depth=1)
except KGZipError as e:
print(e.message, e.context)
Saving tokens
If you feed query results to an LLM/agent, the number of tokens matters. Two levers, both lossless (they remove waste, not information you asked for):
1. trim=True — return only the exact neighbourhood. Without it, a query returns
the seed's whole community capsule (lots of extra context). With it, you get exactly
the depth-hop neighbourhood.
2. Compact serialization — render the subgraph as terse triples instead of verbose JSON, with optional attribute projection:
from kgzip import to_triples, to_compact
res = store.query(["Aspirin"], depth=2, trim=True)
print(to_triples(res.subgraph))
# Aspirin --treats--> Headache
# Headache --associated_with--> GeneX
to_compact(res.subgraph) # ids + types only (leanest)
to_compact(res.subgraph, attrs=["name"], include_attrs=True) # keep only the 'name' attr
Measured on a 1,000-node medical KG, average tokens for a single depth-2 query
(chars/4 estimate):
| Strategy | tokens | vs naive |
|---|---|---|
| Full-capsule result, verbose JSON | 54,114 | 1× |
| Full-capsule result, compact triples | 26,191 | 2.1× less |
trim=True + compact triples (= exact neighbourhood) |
367 | ~147× less |
The trimmed output equals what a precise Neo4j neighbourhood query would return — so
you get targeted-query token cost plus KGZip's storage/offline benefits. If you need
more, just widen depth or set trim=False; nothing is lost, it's your choice.
When should I (not) use KGZip?
- ✅ Your graph is large and you mostly read local neighbourhoods.
- ✅ You want a smaller on-disk representation than raw JSON.
- ✅ Your graph doesn't fit comfortably in memory, so you must read from storage.
- ❌ Your graph is small and fits in RAM, and you traverse it repeatedly in-process — plain in-memory traversal (e.g. NetworkX) will be faster. KGZip's wins are storage size and avoiding full-graph loads, not beating RAM-speed traversal.
How it works (under the hood)
KGZip is built as five layers; you only ever touch the last one (KGZipStore).
- Ingestion (L1) — any input → a clean, immutable
NormalizedGraphwith unique string node IDs. - Decision (L2) — analyse the graph, detect communities (clusters of densely connected nodes via the Louvain algorithm), and plan one capsule per community.
- Encoding (L3) — write each capsule as a compact binary
.kgzcfile (magic bytes, version, header, SHA-256 checksum, compressed payload). Themanifest.kgz.jsonindex is written last, as the atomic commit point. - Query (L4) — use the manifest to find the right capsules, decode them in parallel, verify their checksums, and merge.
- Facade (L5) —
KGZipStoreties it together with locking and lazy loading.
Glossary
- Node — a thing in the graph (has an ID, a type, and attributes).
- Edge / relation — a directed connection between two nodes (has a type and weight).
- Capsule — a cluster of related nodes, stored as one
.kgzcfile. The unit KGZip loads per query. - Manifest —
manifest.kgz.json, the index that maps nodes to capsules. - Community — a group of nodes more densely connected to each other than to the rest of the graph; KGZip turns each into a capsule.
- Boundary / halo node — a node on the edge of a capsule that also connects to a neighbouring capsule; stored in both so no edge is lost.
- Depth — how many hops outward from your seed nodes a query reaches (
None= unbounded). - Trim — prune a query result to the exact depth-hop neighbourhood (token-lean,
lossless w.r.t. the query). Opt-in via
trim=True. - Truncated — a query result flagged
truncated=Truebecausemax_capsulescapped it; the answer is incomplete and you should raise the cap. - Master — your original source-of-truth graph. KGZip never writes to it.
- Lossless — compressing then querying everything returns the exact original graph.
- Bolt — Neo4j's network protocol; a
bolt://host:portURL is the database address.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kgzip-0.1.0.tar.gz.
File metadata
- Download URL: kgzip-0.1.0.tar.gz
- Upload date:
- Size: 76.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8f6881b5e12da0b900b4f61e7fcba85f62fd92f4ec294f01c3068e2f6073d5fb
|
|
| MD5 |
3e24f28c107c16bfc44be021ab91f310
|
|
| BLAKE2b-256 |
130359d5f8511d903f897a58721542b046de87385ecb5f90b8dc3383edcbc10a
|
File details
Details for the file kgzip-0.1.0-py3-none-any.whl.
File metadata
- Download URL: kgzip-0.1.0-py3-none-any.whl
- Upload date:
- Size: 52.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.8.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4149576e2969efcbb20bbc34f42aa5f461d1001e24b3110fbdc4f770ef93bb8c
|
|
| MD5 |
67f6cc9dcfe9ef842e761b2d3ca0f13d
|
|
| BLAKE2b-256 |
b5187653a4f0c68571671af54f9f7f87946c9b1dcffe39c72711cb1b8b172daf
|