Knowledge graph compression engine: parallel-decodable subgraph capsules for fast, storage-efficient KG queries.

These details have not been verified by PyPI

Project description

KGZip

A compression engine for knowledge graphs. KGZip takes a knowledge graph and splits it into small, independently-loadable pieces called capsules, so that when you ask a question about one part of the graph, you only read that part — not the whole thing. The result is a store that is smaller on disk and lets you query large graphs without loading them entirely into memory.

This README assumes no prior knowledge of knowledge graphs. If you already know the basics, jump to Quickstart or the API reference.

New here? Start with the concepts

What is a knowledge graph?

A knowledge graph (KG) is just data stored as things and the relationships between them.

A node is a thing: a drug, a disease, a person, a movie.
An edge is a relationship connecting two nodes: Aspirin treats Headache.

(Aspirin) --treats--> (Headache) --associated_with--> (GeneX)

Here Aspirin, Headache, and GeneX are nodes; treats and associated_with are edges (also called relations). Each node can also carry attributes (properties), e.g. Aspirin's { "formula": "C9H8O4" }.

That's the whole idea. Real KGs just have many more nodes and edges (thousands to billions), often describing a domain like medicine, finance, or social networks.

What problem does KGZip solve?

When a graph is large, two things get painful:

Storage — keeping the whole graph around costs space.
Querying — to answer "what is near node X?", naive tools load or scan the entire graph, even though you only care about a tiny neighbourhood.

KGZip pre-organises the graph into capsules (clusters of closely-related nodes) and writes a small manifest (an index). A query then loads only the capsules it needs. Think of it like a book with chapters and a table of contents: to read about one topic you open one chapter, not the entire book.

The golden rule: KGZip is a read replica

Your original graph (in a file, or in a database like Neo4j) is always the source of truth — the "master". KGZip builds a compressed copy from it for fast reads. KGZip never modifies your original data. If the KGZip store is ever lost or corrupted, you can always rebuild it from the master.

Is it lossless?

Yes. KGZip v1 is lossless: if you compress a graph and then ask for all of its nodes back, you get every node and every edge exactly as they were. Capsules store boundary-crossing edges (and a small "halo" of neighbouring nodes) precisely so that nothing is lost when the pieces are reassembled.

Install

pip install kgzip

# optional: to read directly from a Neo4j database
pip install "kgzip[neo4j]"

From source (for development):

git clone <repo-url> && cd KGZip
pip install -e ".[dev]"
pytest                       # run the test suite

Requires Python ≥ 3.8. Works in plain scripts and in Jupyter notebooks.

Quickstart

Five lines to compress a graph and query it:

import networkx as nx
from kgzip import KGZipStore

store = KGZipStore("./my_store")              # 1. where the compressed store lives
store.compress(nx.karate_club_graph())        # 2. build capsules from a graph
result = store.query(["0", "1"], depth=2)     # 3. ask: what's around nodes 0 and 1?
print(result.subgraph.meta.node_count)        # 4. how many nodes came back

What just happened, line by line:

KGZipStore("./my_store") — open (or prepare to create) a store in that folder. Nothing is read or written yet.
compress(...) — read the graph, cluster it into capsules, and write the capsule files plus a manifest into ./my_store.
query(["0","1"], depth=2) — find the capsules containing nodes "0" and "1", expand outward depth hops, decode just those capsules, and merge them.
The answer is a QueryResult; its .subgraph is a normal graph you can inspect.

Loading data from different sources

compress() accepts several common graph formats. You don't convert anything yourself — KGZip detects the type and reads it.

store.compress(nx.karate_club_graph())   # a NetworkX graph object (in memory)
store.compress("graph.ttl")              # RDF / Turtle file  (.ttl, .n3, .nt)
store.compress("graph.jsonld")           # JSON-LD file
store.compress("edges.csv")              # CSV edge list (see format below)
store.compress("bolt://localhost:7687")  # a live Neo4j database (see below)

CSV edge-list format

The simplest way to bring your own data. One row per edge:

src,dst,relation,weight
Aspirin,Headache,treats,1.0
Headache,GeneX,associated_with,1.0

src, dst — required: the two node IDs the edge connects.
relation — optional: the edge type (defaults to related_to).
weight — optional: a number (defaults to 1.0).
src_type, dst_type — optional: node categories (default unknown).
Any other columns are kept as edge attributes.

Reading directly from Neo4j

Neo4j is a popular graph database. KGZip can read a full snapshot of it over Bolt (Neo4j's network connection protocol — the bolt:// address is just "where the database is listening"). You supply the connection URL and your login:

from kgzip import KGZipStore

# If your Neo4j has no authentication:
store = KGZipStore("./my_store")
store.compress("bolt://localhost:7687")

Most Neo4j databases require a username and password. Pass them via the store's IngestionConfig:

from kgzip import KGZipStore
from kgzip.models import KGZipConfig, IngestionConfig

config = KGZipConfig(
    ingestion=IngestionConfig(
        neo4j_auth=("neo4j", "your-password"),  # (username, password)
        neo4j_database=None,                    # database name; None = server default
        neo4j_node_label=None,                  # only nodes with this label; None = all
    ),
)
store = KGZipStore("./my_store", config)
store.compress("bolt://localhost:7687")          # one-time snapshot read + compress

KGZip reads every node (id(n) becomes the node ID, the first label becomes the node type, properties become attributes) and every relationship. It only reads — your Neo4j data is never changed.

How a query works (the mental model)

   compress() once:                 query() many times:

   master graph                     query(["X"], depth=2)
        │                                  │
        ▼                                  ▼
   ┌──────────┐                     find capsule holding "X"
   │ capsules │  ◄──── reads only ──── + its neighbour capsules
   │ + manifest                          │
   └──────────┘                          ▼
   (on local disk)                  decode those capsules, merge
                                          │
                                          ▼
                                     QueryResult.subgraph

depth controls how far out from your seed nodes to reach. depth=1 is "the seed nodes and their immediate surroundings"; higher depth pulls in more.
KGZip retrieves at capsule granularity — it returns whole clusters, so the result is a superset of the exact neighbourhood (great recall; some extra nodes). Asking for all nodes always returns the complete original graph (lossless).

API reference

KGZipStore is the only class you need. Everything else is internal.

Creating a store

KGZipStore(path, config=None)

path — folder for the compressed store (created on first compress()).
config — optional KGZipConfig to tune clustering/compression (see below).
The manifest is loaded lazily (on your first query()), so creating a store is instant and does no I/O.
Works as a context manager: with KGZipStore(path) as store: ....

`compress(graph, *, config=None) → CapsuleStoreRef`

Builds the compressed store from a graph. Accepts any supported source (NetworkX object, file path, or bolt:// URL). Steps it runs for you: ingest → cluster → encode → write capsules → write manifest (written last, as the safe commit point).

Returns a CapsuleStoreRef describing the new store: manifest_path, capsule_count, total_bytes, gcs_summary, store_version, created_at.
Idempotent by default (overwrite=False): re-compressing the same graph skips capsules whose content hasn't changed.
Thread/process-safe: takes a file lock so two compresses can't clobber each other.

ref = store.compress("edges.csv")
print(ref.capsule_count, ref.total_bytes)

`query(node_ids, depth=1, **kwargs) → QueryResult`

Fetch the subgraph around one or more seed nodes.

node_ids — list of node IDs to start from (must be non-empty).
depth — how many hops to expand. depth=1 is the seeds and their immediate surroundings; higher pulls in more. depth=None = unbounded (follow the graph until nothing new is reachable — the whole connected subgraph).
Optional keyword arguments:
- trim: bool = False — token control. False returns the full capsule contents (a superset of the neighbourhood — more context). True prunes the result down to the exact depth-hop neighbourhood of your seeds. Trimming is lossless relative to the query (it never drops anything within depth hops) and can cut output ~100× on large graphs. See Saving tokens.
- max_capsules: int = 50 — safety cap on how many capsules one query may load. Set higher, or None, to fetch large/complete subgraphs. If the cap limits a result, QueryResult.truncated is set to True (never a silent partial answer).
- relation_filter: list[str] — keep only edges of these relation types.
- consistency: "eventual" | "strict" — "strict" re-fetches stale parts from the master via master_kg_fn instead of serving possibly-stale capsule data.
- timeout_ms: int — max time to wait for parallel decoding (default 5000).
- master_kg_fn: Callable — required when consistency="strict"; you write a function node_ids -> fresh subgraph that fetches from your master.

Returns a QueryResult:

Field	Meaning
`subgraph`	the merged result graph (a `NormalizedGraph`)
`capsules_loaded`	how many capsules were read
`latency_ms`	how long the query took
`stale_capsules`	IDs of capsules flagged stale
`fallback_used`	`True` if the master was consulted (strict mode)
`query_node_ids_not_found`	seed IDs that weren't in the store
`truncated`	`True` if `max_capsules` limited the result (it's incomplete)

# Token-lean: exact 2-hop neighbourhood, only "treats" edges
res = store.query(["Aspirin"], depth=2, trim=True, relation_filter=["treats"])

# Agent escape hatch: not satisfied? fetch everything reachable, no caps
res = store.query(["Aspirin"], depth=None, max_capsules=None)
if res.truncated:
    print("result was capped — raise max_capsules")

Iterative deepening (for AI agents)

The defaults are safe (you never get less than the true neighbourhood). An agent can start cheap and widen only when needed:

res = store.query(seeds, depth=1, trim=True)      # cheap, few tokens
if not_enough(res):
    res = store.query(seeds, depth=3, trim=True)   # go deeper
if still_not_enough(res):
    res = store.query(seeds, depth=None, max_capsules=None)  # the whole reachable graph

`sync(master_graph=None) → SyncReport`

Keep the store consistent with a changed master.

sync() with no argument → marks all capsules stale (they'll be treated as out-of-date until rebuilt).
sync(updated_graph) → re-compresses the store from the updated graph.
Returns a SyncReport: stale_count, re_encoded_count, skipped_count, sync_duration_ms.

`status() → StoreStatus`

A safe, never-raises health check.

Returns StoreStatus: exists, capsule_count, stale_count, total_bytes, store_version, last_encoded_at.
exists=False means nothing has been compressed yet.

if not store.status().exists:
    store.compress(my_graph)

Configuration

Tune how KGZip clusters and compresses. Defaults are sensible — change these only if you need to.

from kgzip.models import KGZipConfig, DecisionConfig, StorageConfig

config = KGZipConfig(
    decision=DecisionConfig(
        max_capsule_nodes=500,   # biggest a capsule may get (bigger ones are split)
        min_capsule_nodes=5,     # smallest; tiny clusters merge into a neighbour
        spectral_k=8,            # size of each capsule's structural "fingerprint"
        random_seed=42,          # makes clustering reproducible
    ),
    storage=StorageConfig(
        base_path="./my_store",
        compression="zstd",      # "zstd" (best) | "gzip" | "none"
        compression_level=3,     # 1–19 for zstd (higher = smaller, slower)
        overwrite=False,         # True = always re-encode, even if unchanged
    ),
)
store = KGZipStore("./my_store", config)

Errors

Every error KGZip raises is a subclass of kgzip.KGZipError and carries a message plus a context dict for debugging. Common ones:

Exception	When
`EmptyGraphError`	the input graph has no nodes
`SchemaError`	a CSV is missing required `src`/`dst` columns
`SoftDependencyError`	an optional library (e.g. `neo4j`) isn't installed
`ConnectionError`	a Neo4j database couldn't be reached
`StoreNotFoundError`	you queried before compressing
`CorruptionError` / `VersionError`	a capsule file is damaged or wrong version
`QueryError`	bad query input (e.g. empty `node_ids`)

from kgzip import KGZipError
try:
    store.query([], depth=1)
except KGZipError as e:
    print(e.message, e.context)

Saving tokens

If you feed query results to an LLM/agent, the number of tokens matters. Two levers, both lossless (they remove waste, not information you asked for):

1. trim=True — return only the exact neighbourhood. Without it, a query returns the seed's whole community capsule (lots of extra context). With it, you get exactly the depth-hop neighbourhood.

2. Compact serialization — render the subgraph as terse triples instead of verbose JSON, with optional attribute projection:

from kgzip import to_triples, to_compact

res = store.query(["Aspirin"], depth=2, trim=True)

print(to_triples(res.subgraph))
# Aspirin --treats--> Headache
# Headache --associated_with--> GeneX

to_compact(res.subgraph)                                   # ids + types only (leanest)
to_compact(res.subgraph, attrs=["name"], include_attrs=True)  # keep only the 'name' attr

Measured on a 1,000-node medical KG, average tokens for a single depth-2 query (chars/4 estimate):

Strategy	tokens	vs naive
Full-capsule result, verbose JSON	54,114	1×
Full-capsule result, compact triples	26,191	2.1× less
`trim=True` + compact triples (= exact neighbourhood)	367	~147× less

The trimmed output equals what a precise Neo4j neighbourhood query would return — so you get targeted-query token cost plus KGZip's storage/offline benefits. If you need more, just widen depth or set trim=False; nothing is lost, it's your choice.

When should I (not) use KGZip?

✅ Your graph is large and you mostly read local neighbourhoods.
✅ You want a smaller on-disk representation than raw JSON.
✅ Your graph doesn't fit comfortably in memory, so you must read from storage.
❌ Your graph is small and fits in RAM, and you traverse it repeatedly in-process — plain in-memory traversal (e.g. NetworkX) will be faster. KGZip's wins are storage size and avoiding full-graph loads, not beating RAM-speed traversal.

How it works (under the hood)

KGZip is built as five layers; you only ever touch the last one (KGZipStore).

Ingestion (L1) — any input → a clean, immutable NormalizedGraph with unique string node IDs.
Decision (L2) — analyse the graph, detect communities (clusters of densely connected nodes via the Louvain algorithm), and plan one capsule per community.
Encoding (L3) — write each capsule as a compact binary .kgzc file (magic bytes, version, header, SHA-256 checksum, compressed payload). The manifest.kgz.json index is written last, as the atomic commit point.
Query (L4) — use the manifest to find the right capsules, decode them in parallel, verify their checksums, and merge.
Facade (L5) — KGZipStore ties it together with locking and lazy loading.

Glossary

Node — a thing in the graph (has an ID, a type, and attributes).
Edge / relation — a directed connection between two nodes (has a type and weight).
Capsule — a cluster of related nodes, stored as one .kgzc file. The unit KGZip loads per query.
Manifest — manifest.kgz.json, the index that maps nodes to capsules.
Community — a group of nodes more densely connected to each other than to the rest of the graph; KGZip turns each into a capsule.
Boundary / halo node — a node on the edge of a capsule that also connects to a neighbouring capsule; stored in both so no edge is lost.
Depth — how many hops outward from your seed nodes a query reaches (None = unbounded).
Trim — prune a query result to the exact depth-hop neighbourhood (token-lean, lossless w.r.t. the query). Opt-in via trim=True.
Truncated — a query result flagged truncated=True because max_capsules capped it; the answer is incomplete and you should raise the cap.
Master — your original source-of-truth graph. KGZip never writes to it.
Lossless — compressing then querying everything returns the exact original graph.
Bolt — Neo4j's network protocol; a bolt://host:port URL is the database address.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kgzip-0.1.0.tar.gz (76.2 kB view details)

Uploaded Jun 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kgzip-0.1.0-py3-none-any.whl (52.5 kB view details)

Uploaded Jun 27, 2026 Python 3

File details

Details for the file kgzip-0.1.0.tar.gz.

File metadata

Download URL: kgzip-0.1.0.tar.gz
Upload date: Jun 27, 2026
Size: 76.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.3

File hashes

Hashes for kgzip-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8f6881b5e12da0b900b4f61e7fcba85f62fd92f4ec294f01c3068e2f6073d5fb`
MD5	`3e24f28c107c16bfc44be021ab91f310`
BLAKE2b-256	`130359d5f8511d903f897a58721542b046de87385ecb5f90b8dc3383edcbc10a`

See more details on using hashes here.

File details

Details for the file kgzip-0.1.0-py3-none-any.whl.

File metadata

Download URL: kgzip-0.1.0-py3-none-any.whl
Upload date: Jun 27, 2026
Size: 52.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.8.3

File hashes

Hashes for kgzip-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4149576e2969efcbb20bbc34f42aa5f461d1001e24b3110fbdc4f770ef93bb8c`
MD5	`67f6cc9dcfe9ef842e761b2d3ca0f13d`
BLAKE2b-256	`b5187653a4f0c68571671af54f9f7f87946c9b1dcffe39c72711cb1b8b172daf`

See more details on using hashes here.

kgzip 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

KGZip

New here? Start with the concepts

What is a knowledge graph?

What problem does KGZip solve?

The golden rule: KGZip is a read replica

Is it lossless?

Install

Quickstart

Loading data from different sources

CSV edge-list format

Reading directly from Neo4j

How a query works (the mental model)

API reference

Creating a store

compress(graph, *, config=None) → CapsuleStoreRef

query(node_ids, depth=1, **kwargs) → QueryResult

Iterative deepening (for AI agents)

sync(master_graph=None) → SyncReport

status() → StoreStatus

Configuration

Errors

Saving tokens

When should I (not) use KGZip?

How it works (under the hood)

Glossary

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`compress(graph, *, config=None) → CapsuleStoreRef`

`query(node_ids, depth=1, **kwargs) → QueryResult`

`sync(master_graph=None) → SyncReport`

`status() → StoreStatus`