Skip to main content

Knowledge graph compression engine: parallel-decodable subgraph capsules for fast, storage-efficient KG queries.

Project description

KGZip

A compression engine for knowledge graphs. KGZip takes a knowledge graph and splits it into small, independently-loadable pieces called capsules, so that when you ask a question about one part of the graph, you only read that part — not the whole thing. The result is a store that is smaller on disk and lets you query large graphs without loading them entirely into memory.

This README assumes no prior knowledge of knowledge graphs. If you already know the basics, jump to Quickstart or the API reference.


New here? Start with the concepts

What is a knowledge graph?

A knowledge graph (KG) is just data stored as things and the relationships between them.

  • A node is a thing: a drug, a disease, a person, a movie.
  • An edge is a relationship connecting two nodes: Aspirin treats Headache.
(Aspirin) --treats--> (Headache) --associated_with--> (GeneX)

Here Aspirin, Headache, and GeneX are nodes; treats and associated_with are edges (also called relations). Each node can also carry attributes (properties), e.g. Aspirin's { "formula": "C9H8O4" }.

That's the whole idea. Real KGs just have many more nodes and edges (thousands to billions), often describing a domain like medicine, finance, or social networks.

What problem does KGZip solve?

When a graph is large, two things get painful:

  1. Storage — keeping the whole graph around costs space.
  2. Querying — to answer "what is near node X?", naive tools load or scan the entire graph, even though you only care about a tiny neighbourhood.

KGZip pre-organises the graph into capsules (clusters of closely-related nodes) and writes a small manifest (an index). A query then loads only the capsules it needs. Think of it like a book with chapters and a table of contents: to read about one topic you open one chapter, not the entire book.

The golden rule: KGZip is a read replica

Your original graph (in a file, or in a database like Neo4j) is always the source of truth — the "master". KGZip builds a compressed copy from it for fast reads. KGZip never modifies your original data. If the KGZip store is ever lost or corrupted, you can always rebuild it from the master.

Is it lossless?

Yes. KGZip v1 is lossless: if you compress a graph and then ask for all of its nodes back, you get every node and every edge exactly as they were. Capsules store boundary-crossing edges (and a small "halo" of neighbouring nodes) precisely so that nothing is lost when the pieces are reassembled.


Install

pip install kgzip

# optional: to read directly from a Neo4j database
pip install "kgzip[neo4j]"

From source (for development):

git clone <repo-url> && cd KGZip
pip install -e ".[dev]"
pytest                       # run the test suite

Requires Python ≥ 3.8. Works in plain scripts and in Jupyter notebooks.


Quickstart

Five lines to compress a graph and query it:

import networkx as nx
from kgzip import KGZipStore

store = KGZipStore("./my_store")              # 1. where the compressed store lives
store.compress(nx.karate_club_graph())        # 2. build capsules from a graph
result = store.query(["0", "1"], depth=2)     # 3. ask: what's around nodes 0 and 1?
print(result.subgraph.meta.node_count)        # 4. how many nodes came back

What just happened, line by line:

  1. KGZipStore("./my_store") — open (or prepare to create) a store in that folder. Nothing is read or written yet.
  2. compress(...) — read the graph, cluster it into capsules, and write the capsule files plus a manifest into ./my_store.
  3. query(["0","1"], depth=2) — find the capsules containing nodes "0" and "1", expand outward depth hops, decode just those capsules, and merge them.
  4. The answer is a QueryResult; its .subgraph is a normal graph you can inspect.

Loading data from different sources

compress() accepts several common graph formats. You don't convert anything yourself — KGZip detects the type and reads it.

store.compress(nx.karate_club_graph())   # a NetworkX graph object (in memory)
store.compress("graph.ttl")              # RDF / Turtle file  (.ttl, .n3, .nt)
store.compress("graph.jsonld")           # JSON-LD file
store.compress("edges.csv")              # CSV edge list (see format below)
store.compress("bolt://localhost:7687")  # a live Neo4j database (see below)

CSV edge-list format

The simplest way to bring your own data. One row per edge:

src,dst,relation,weight
Aspirin,Headache,treats,1.0
Headache,GeneX,associated_with,1.0
  • src, dstrequired: the two node IDs the edge connects.
  • relation — optional: the edge type (defaults to related_to).
  • weight — optional: a number (defaults to 1.0).
  • src_type, dst_type — optional: node categories (default unknown).
  • Any other columns are kept as edge attributes.

Reading directly from Neo4j

Neo4j is a popular graph database. KGZip can read a full snapshot of it over Bolt (Neo4j's network connection protocol — the bolt:// address is just "where the database is listening"). You supply the connection URL and your login:

from kgzip import KGZipStore

# If your Neo4j has no authentication:
store = KGZipStore("./my_store")
store.compress("bolt://localhost:7687")

Most Neo4j databases require a username and password. Pass them via the store's IngestionConfig:

from kgzip import KGZipStore
from kgzip.models import KGZipConfig, IngestionConfig

config = KGZipConfig(
    ingestion=IngestionConfig(
        neo4j_auth=("neo4j", "your-password"),  # (username, password)
        neo4j_database=None,                    # database name; None = server default
        neo4j_node_label=None,                  # only nodes with this label; None = all
    ),
)
store = KGZipStore("./my_store", config)
store.compress("bolt://localhost:7687")          # one-time snapshot read + compress

KGZip reads every node (id(n) becomes the node ID, the first label becomes the node type, properties become attributes) and every relationship. It only reads — your Neo4j data is never changed.


How a query works (the mental model)

   compress() once:                 query() many times:

   master graph                     query(["X"], depth=2)
        │                                  │
        ▼                                  ▼
   ┌──────────┐                     find capsule holding "X"
   │ capsules │  ◄──── reads only ──── + its neighbour capsules
   │ + manifest                          │
   └──────────┘                          ▼
   (on local disk)                  decode those capsules, merge
                                          │
                                          ▼
                                     QueryResult.subgraph
  • depth controls how far out from your seed nodes to reach. depth=1 is "the seed nodes and their immediate surroundings"; higher depth pulls in more.
  • KGZip retrieves at capsule granularity — it returns whole clusters, so the result is a superset of the exact neighbourhood (great recall; some extra nodes). Asking for all nodes always returns the complete original graph (lossless).

API reference

KGZipStore is the only class you need. Everything else is internal.

Creating a store

KGZipStore(path, config=None)
  • path — folder for the compressed store (created on first compress()).
  • config — optional KGZipConfig to tune clustering/compression (see below).
  • The manifest is loaded lazily (on your first query()), so creating a store is instant and does no I/O.
  • Works as a context manager: with KGZipStore(path) as store: ....

compress(graph, *, config=None) → CapsuleStoreRef

Builds the compressed store from a graph. Accepts any supported source (NetworkX object, file path, or bolt:// URL). Steps it runs for you: ingest → cluster → encode → write capsules → write manifest (written last, as the safe commit point).

  • Returns a CapsuleStoreRef describing the new store: manifest_path, capsule_count, total_bytes, gcs_summary, store_version, created_at.
  • Idempotent by default (overwrite=False): re-compressing the same graph skips capsules whose content hasn't changed.
  • Thread/process-safe: takes a file lock so two compresses can't clobber each other.
ref = store.compress("edges.csv")
print(ref.capsule_count, ref.total_bytes)

query(node_ids, depth=1, **kwargs) → QueryResult

Fetch the subgraph around one or more seed nodes.

  • node_ids — list of node IDs to start from (must be non-empty).
  • depth — how many hops to expand. depth=1 is the seeds and their immediate surroundings; higher pulls in more. depth=None = unbounded (follow the graph until nothing new is reachable — the whole connected subgraph).
  • Optional keyword arguments:
    • trim: bool = Falsetoken control. False returns the full capsule contents (a superset of the neighbourhood — more context). True prunes the result down to the exact depth-hop neighbourhood of your seeds. Trimming is lossless relative to the query (it never drops anything within depth hops) and can cut output ~100× on large graphs. See Saving tokens.
    • max_capsules: int = 50 — safety cap on how many capsules one query may load. Set higher, or None, to fetch large/complete subgraphs. If the cap limits a result, QueryResult.truncated is set to True (never a silent partial answer).
    • relation_filter: list[str] — keep only edges of these relation types.
    • consistency: "eventual" | "strict""strict" re-fetches stale parts from the master via master_kg_fn instead of serving possibly-stale capsule data.
    • timeout_ms: int — max time to wait for parallel decoding (default 5000).
    • master_kg_fn: Callable — required when consistency="strict"; you write a function node_ids -> fresh subgraph that fetches from your master.

Returns a QueryResult:

Field Meaning
subgraph the merged result graph (a NormalizedGraph)
capsules_loaded how many capsules were read
latency_ms how long the query took
stale_capsules IDs of capsules flagged stale
fallback_used True if the master was consulted (strict mode)
query_node_ids_not_found seed IDs that weren't in the store
truncated True if max_capsules limited the result (it's incomplete)
# Token-lean: exact 2-hop neighbourhood, only "treats" edges
res = store.query(["Aspirin"], depth=2, trim=True, relation_filter=["treats"])

# Agent escape hatch: not satisfied? fetch everything reachable, no caps
res = store.query(["Aspirin"], depth=None, max_capsules=None)
if res.truncated:
    print("result was capped — raise max_capsules")

Iterative deepening (for AI agents)

The defaults are safe (you never get less than the true neighbourhood). An agent can start cheap and widen only when needed:

res = store.query(seeds, depth=1, trim=True)      # cheap, few tokens
if not_enough(res):
    res = store.query(seeds, depth=3, trim=True)   # go deeper
if still_not_enough(res):
    res = store.query(seeds, depth=None, max_capsules=None)  # the whole reachable graph

sync(master_graph=None) → SyncReport

Keep the store consistent with a changed master.

  • sync() with no argument → marks all capsules stale (they'll be treated as out-of-date until rebuilt).
  • sync(updated_graph) → re-compresses the store from the updated graph.
  • Returns a SyncReport: stale_count, re_encoded_count, skipped_count, sync_duration_ms.

status() → StoreStatus

A safe, never-raises health check.

  • Returns StoreStatus: exists, capsule_count, stale_count, total_bytes, store_version, last_encoded_at.
  • exists=False means nothing has been compressed yet.
if not store.status().exists:
    store.compress(my_graph)

Configuration

Tune how KGZip clusters and compresses. Defaults are sensible — change these only if you need to.

from kgzip.models import KGZipConfig, DecisionConfig, StorageConfig

config = KGZipConfig(
    decision=DecisionConfig(
        max_capsule_nodes=500,   # biggest a capsule may get (bigger ones are split)
        min_capsule_nodes=5,     # smallest; tiny clusters merge into a neighbour
        spectral_k=8,            # size of each capsule's structural "fingerprint"
        random_seed=42,          # makes clustering reproducible
    ),
    storage=StorageConfig(
        base_path="./my_store",
        compression="zstd",      # "zstd" (best) | "gzip" | "none"
        compression_level=3,     # 1–19 for zstd (higher = smaller, slower)
        overwrite=False,         # True = always re-encode, even if unchanged
    ),
)
store = KGZipStore("./my_store", config)

Errors

Every error KGZip raises is a subclass of kgzip.KGZipError and carries a message plus a context dict for debugging. Common ones:

Exception When
EmptyGraphError the input graph has no nodes
SchemaError a CSV is missing required src/dst columns
SoftDependencyError an optional library (e.g. neo4j) isn't installed
ConnectionError a Neo4j database couldn't be reached
StoreNotFoundError you queried before compressing
CorruptionError / VersionError a capsule file is damaged or wrong version
QueryError bad query input (e.g. empty node_ids)
from kgzip import KGZipError
try:
    store.query([], depth=1)
except KGZipError as e:
    print(e.message, e.context)

Saving tokens

If you feed query results to an LLM/agent, the number of tokens matters. Two levers, both lossless (they remove waste, not information you asked for):

1. trim=True — return only the exact neighbourhood. Without it, a query returns the seed's whole community capsule (lots of extra context). With it, you get exactly the depth-hop neighbourhood.

2. Compact serialization — render the subgraph as terse triples instead of verbose JSON, with optional attribute projection:

from kgzip import to_triples, to_compact

res = store.query(["Aspirin"], depth=2, trim=True)

print(to_triples(res.subgraph))
# Aspirin --treats--> Headache
# Headache --associated_with--> GeneX

to_compact(res.subgraph)                                   # ids + types only (leanest)
to_compact(res.subgraph, attrs=["name"], include_attrs=True)  # keep only the 'name' attr

Measured on a 1,000-node medical KG, average tokens for a single depth-2 query (chars/4 estimate):

Strategy tokens vs naive
Full-capsule result, verbose JSON 54,114
Full-capsule result, compact triples 26,191 2.1× less
trim=True + compact triples (= exact neighbourhood) 367 ~147× less

The trimmed output equals what a precise Neo4j neighbourhood query would return — so you get targeted-query token cost plus KGZip's storage/offline benefits. If you need more, just widen depth or set trim=False; nothing is lost, it's your choice.


When should I (not) use KGZip?

  • ✅ Your graph is large and you mostly read local neighbourhoods.
  • ✅ You want a smaller on-disk representation than raw JSON.
  • ✅ Your graph doesn't fit comfortably in memory, so you must read from storage.
  • ❌ Your graph is small and fits in RAM, and you traverse it repeatedly in-process — plain in-memory traversal (e.g. NetworkX) will be faster. KGZip's wins are storage size and avoiding full-graph loads, not beating RAM-speed traversal.

How it works (under the hood)

KGZip is built as five layers; you only ever touch the last one (KGZipStore).

  1. Ingestion (L1) — any input → a clean, immutable NormalizedGraph with unique string node IDs.
  2. Decision (L2) — analyse the graph, detect communities (clusters of densely connected nodes via the Louvain algorithm), and plan one capsule per community.
  3. Encoding (L3) — write each capsule as a compact binary .kgzc file (magic bytes, version, header, SHA-256 checksum, compressed payload). The manifest.kgz.json index is written last, as the atomic commit point.
  4. Query (L4) — use the manifest to find the right capsules, decode them in parallel, verify their checksums, and merge.
  5. Facade (L5)KGZipStore ties it together with locking and lazy loading.

Glossary

  • Node — a thing in the graph (has an ID, a type, and attributes).
  • Edge / relation — a directed connection between two nodes (has a type and weight).
  • Capsule — a cluster of related nodes, stored as one .kgzc file. The unit KGZip loads per query.
  • Manifestmanifest.kgz.json, the index that maps nodes to capsules.
  • Community — a group of nodes more densely connected to each other than to the rest of the graph; KGZip turns each into a capsule.
  • Boundary / halo node — a node on the edge of a capsule that also connects to a neighbouring capsule; stored in both so no edge is lost.
  • Depth — how many hops outward from your seed nodes a query reaches (None = unbounded).
  • Trim — prune a query result to the exact depth-hop neighbourhood (token-lean, lossless w.r.t. the query). Opt-in via trim=True.
  • Truncated — a query result flagged truncated=True because max_capsules capped it; the answer is incomplete and you should raise the cap.
  • Master — your original source-of-truth graph. KGZip never writes to it.
  • Lossless — compressing then querying everything returns the exact original graph.
  • Bolt — Neo4j's network protocol; a bolt://host:port URL is the database address.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kgzip-0.1.0.tar.gz (76.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kgzip-0.1.0-py3-none-any.whl (52.5 kB view details)

Uploaded Python 3

File details

Details for the file kgzip-0.1.0.tar.gz.

File metadata

  • Download URL: kgzip-0.1.0.tar.gz
  • Upload date:
  • Size: 76.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.3

File hashes

Hashes for kgzip-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8f6881b5e12da0b900b4f61e7fcba85f62fd92f4ec294f01c3068e2f6073d5fb
MD5 3e24f28c107c16bfc44be021ab91f310
BLAKE2b-256 130359d5f8511d903f897a58721542b046de87385ecb5f90b8dc3383edcbc10a

See more details on using hashes here.

File details

Details for the file kgzip-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: kgzip-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 52.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.8.3

File hashes

Hashes for kgzip-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4149576e2969efcbb20bbc34f42aa5f461d1001e24b3110fbdc4f770ef93bb8c
MD5 67f6cc9dcfe9ef842e761b2d3ca0f13d
BLAKE2b-256 b5187653a4f0c68571671af54f9f7f87946c9b1dcffe39c72711cb1b8b172daf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page