Skip to main content

Python bindings for ProllyTree - a probabilistic tree for efficient storage and retrieval

Project description

ProllyTree Python Bindings

Documentation PyPI

Python bindings to the Rust ProllyTree crate — a probabilistic B-tree with Merkle properties: a content-addressed, Git-versioned key-value store with branching, three-way merge, cryptographic proofs, optional SQL, and an optional vector / text-search index.

A prolly tree's shape is a deterministic function of its contents, so two replicas holding the same key-value set converge to the same root hash regardless of insertion order. That property is what makes the rest — Git-style versioning, efficient diff/sync between replicas, verifiable subtree sharing across history — fall out for free.

Quick Start

Installation

pip install prollytree

PyPI wheels ship git, sql, rocksdb_storage, proximity, and proximity_text enabled by default — text search and the bundled MiniLM embedder are available out of the box.

Basic tree

from prollytree import ProllyTree

tree = ProllyTree()
tree.insert(b"hello", b"world")
tree.find(b"hello")                            # b"world"

proof = tree.generate_proof(b"hello")
tree.verify_proof(proof, b"hello", b"world")   # True

Versioned KV store (one key space)

from prollytree import VersionedKvStore

store = VersionedKvStore("./data")
store.insert(b"config:theme", b"light")
store.commit("seed config")

store.create_branch("experiment")
store.update(b"config:theme", b"dark")
store.commit("dark mode")
store.checkout("main")                          # back to light

Namespaced KV store + optional text search

from prollytree import NamespacedKvStore, MiniLmEmbedder

store = NamespacedKvStore("./data")
store.text_index_open("docs", "by_body", MiniLmEmbedder())
store.set_cascade("docs", ["by_body"])         # primary writes auto-index

store.ns_insert("docs", b"doc:1", b"the quick brown fox")
store.commit("seed corpus")

for doc_id, dist in store.text_index_search("docs", "by_body", "vulpine animal", k=3):
    print(doc_id, dist, store.ns_get("docs", doc_id))

Documentation

Complete Documentation

The full documentation includes:

Features

  • Probabilistic B-tree with Merkle properties — O(log n) ops, cryptographic inclusion proofs
  • Git-versioned KV store — branch / commit / diff / three-way merge on raw key-value state
  • Namespaced KV store — many isolated prolly trees in one Git repo, atomic across namespaces
  • Optional text / vector search — versioned ANN index inside any namespace; bundled MiniLM, hash, and Python-callable embedders
  • Cascade + drift management — atomic dual-write of primary + index, audit + repair APIs
  • Large-value externalization — values above a threshold land in content-addressed blobs
  • Multiple storage backends — In-memory, File, RocksDB, Git-backed
  • SQL interface — query the tree as relational tables via GlueSQL

Good fits

  • Auditable application state — config systems, feature flags, policy rules: real Git history with diff, blame, rollback, and proofs for free.
  • Distributed / multi-replica data — convergent root hashes + subtree sharing make peer-to-peer sync O(changes).
  • AI agent memory — per-agent namespaces, branchable scratch spaces, semantic recall in one transaction. See the text-search guide.
  • Versioned analytical datasets — SQL over a Git-tracked KV store; checkout a historical commit and run the same query.
  • Content-addressed indexes — verifiable logs, proof systems, gossip-friendly indexes.

Key Use Cases

Versioned Storage

from prollytree import VersionedKvStore, StorageBackend

# Default Git backend (recommended for full version control)
store = VersionedKvStore("./data")

# Or explicitly choose a storage backend
store = VersionedKvStore("./data", StorageBackend.Git)      # Full git versioning
store = VersionedKvStore("./data", StorageBackend.File)     # File-based storage
store = VersionedKvStore("./data", StorageBackend.InMemory) # In-memory (volatile)
store = VersionedKvStore("./data", StorageBackend.RocksDB)  # RocksDB (requires rocksdb_storage feature)

# Basic operations
store.insert(b"config", b"production_settings")
commit_id = store.commit("Add production config")

# Branch and experiment
store.create_branch("experiment")
store.insert(b"feature", b"experimental_data")
store.commit("Add experimental feature")

# Merge branches (Git backend only)
store.checkout("main")
store.merge("experiment")

# Diff between branches (Git backend only)
diffs = store.diff("main", "experiment")
for diff in diffs:
    print(f"Key: {diff.key}, Operation: {diff.operation}")

# Cryptographic verification on versioned data
proof = store.generate_proof(b"config")
is_valid = store.verify_proof(proof, b"config", b"production_settings")

Namespaced Storage

NamespacedKvStore is the multi-tree counterpart of VersionedKvStore. Each namespace owns its own prolly tree, but every namespace shares a single git history — commit, branch, and checkout move every namespace together.

from prollytree import NamespacedKvStore

store = NamespacedKvStore("./data")

# Per-namespace primary KV writes. Each namespace owns its own key space —
# the same key in two namespaces resolves independently.
store.ns_insert("users",    b"u:alice", b"Alice")
store.ns_insert("settings", b"theme",   b"dark")
store.commit("seed users + settings")        # one commit, both namespaces

store.branch("experiment")                   # create + switch
store.ns_insert("settings", b"theme", b"light")
store.commit("flip theme on experiment")

store.checkout("main")
store.ns_get("settings", b"theme")           # b"dark" again
store.list_namespaces()                      # ['users', 'settings', ...]

Migrating from VersionedKvStore is mostly mechanical — store.insert(k, v) becomes store.ns_insert(namespace, k, v). The branching API is store.branch

  • store.checkout (note: current_branch is a property, not a method). See python/examples/namespaced_example.py for a complete walkthrough.

Vector / Text Search

Any namespace can own zero or more text sub-indexes. A text index turns documents into vectors via a configurable embedder and gives you top-k similarity search that is versioned alongside the primary tree — branching and merging cover both the primary tree and every sub-index atomically.

The primary KV tree is the source of truth; the text index stores only (id, vector) pairs. Always write the document body into the primary tree too — either explicitly or by enabling cascade — so you can resolve search hits back to text and reindex if the embedder ever changes.

from prollytree import NamespacedKvStore, MiniLmEmbedder

store = NamespacedKvStore("./data")
emb = MiniLmEmbedder()                       # bundled Candle + all-MiniLM-L6-v2

# text_index_open creates or re-opens the index. The embedder's id + version
# are persisted; opening with a mismatched embedder raises a clear error.
store.text_index_open("docs", "by_body", emb)

# Dual write: primary tree (source of truth) + text index (pointer).
docs = {
    b"doc:1": "the quick brown fox",
    b"doc:2": "lazy dog asleep on the mat",
}
for doc_id, text in docs.items():
    store.ns_insert("docs", doc_id, text.encode())
    store.text_index_insert("docs", "by_body", doc_id, text)
store.commit("seed corpus")

# Search returns (id_bytes, distance); resolve back to text via the primary.
for doc_id, score in store.text_index_search("docs", "by_body", "vulpine animal", k=5):
    body = store.ns_get("docs", doc_id).decode()
    print(f"{doc_id} (d={score:.3f}): {body}")

Three embedder options are bundled:

from prollytree import HashEmbedder, MiniLmEmbedder, CallableEmbedder

HashEmbedder(dim=384, seed=0)                # deterministic, ML-free; tests / demos
MiniLmEmbedder()                             # bundled Candle + MiniLM-L6-v2 (semantic)
CallableEmbedder(                            # wrap any Python function
    id="openai:text-embedding-3-small",
    version="2024-01",
    dim=1536,
    embed_fn=my_openai_embed,
)

Cascade mode replaces the dual-write with a single ns_insert — the registered text indexes auto-mirror every primary write (and primary delete):

store.text_index_open("docs", "by_body", emb)
store.set_cascade("docs", ["by_body"])       # opt-in, per namespace

# One call now writes to both the primary tree AND the text index.
store.ns_insert("docs", b"doc:3", b"branching is a first-class operation")
store.commit("cascade-driven indexing")

Other knobs:

  • chunker="line" splits each document on \n and indexes per-line; search dedups results back to the document id.
  • audit_text_index(ns, idx) returns {orphans_in_index, missing_from_index, is_in_sync} to detect drift; purge_text_index_orphans(ns, idx) repairs it.
  • set_externalize_threshold(n) + gc_blobs() push large values into a blob store and garbage-collect unreferenced blobs (File / RocksDB backends).

Feature-availability flags let callers fall back gracefully:

import prollytree as p
if p.proximity_text_available:
    emb = p.MiniLmEmbedder()
elif p.proximity_available:
    emb = p.HashEmbedder(384, 0)
else:
    raise RuntimeError("wheel built without proximity features")

See python/examples/text_index_example.py for a runnable walkthrough covering cascade, multi-chunk indexing, drift repair, and every embedder.

SQL Queries

from prollytree import ProllySQLStore

sql_store = ProllySQLStore("./database")
sql_store.execute("CREATE TABLE users (id INT, name TEXT)")
sql_store.execute("INSERT INTO users VALUES (1, 'Alice')")
results = sql_store.execute("SELECT * FROM users WHERE name = 'Alice'")

Probabilistic Trees (raw building block)

When you need the verifiable B-tree without the versioning layer.

from prollytree import ProllyTree

tree = ProllyTree()
tree.insert(b"user:123", b"Alice")
tree.insert(b"user:456", b"Bob")

# Cryptographic verification
proof = tree.generate_proof(b"user:123")
is_valid = tree.verify_proof(proof, b"user:123", b"Alice")

Development

Building from Source

git clone https://github.com/zhangfengcdt/prollytree
cd prollytree
./python/build_python.sh --all-features --install

Running Tests

cd python/tests
python test_prollytree.py

License

Licensed under the Apache License, Version 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prollytree-0.4.0.tar.gz (450.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

prollytree-0.4.0-cp38-abi3-win_amd64.whl (10.2 MB view details)

Uploaded CPython 3.8+Windows x86-64

prollytree-0.4.0-cp38-abi3-manylinux_2_28_x86_64.whl (13.2 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ x86-64

prollytree-0.4.0-cp38-abi3-manylinux_2_28_aarch64.whl (8.6 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.28+ ARM64

prollytree-0.4.0-cp38-abi3-macosx_11_0_arm64.whl (10.5 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

File details

Details for the file prollytree-0.4.0.tar.gz.

File metadata

  • Download URL: prollytree-0.4.0.tar.gz
  • Upload date:
  • Size: 450.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for prollytree-0.4.0.tar.gz
Algorithm Hash digest
SHA256 dae96148628c279aa989163fe5e98dbe1bf6491470df74e3eb6ca2c6f5e7141b
MD5 6552b9a4a4956536ac9b7ae0f2d23c07
BLAKE2b-256 a993d1ba15f9dfea0b2c1e5bf191c5b4dc3e22f5a072366ffb19e67193d50811

See more details on using hashes here.

Provenance

The following attestation bundles were made for prollytree-0.4.0.tar.gz:

Publisher: release.yml on zhangfengcdt/prollytree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prollytree-0.4.0-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: prollytree-0.4.0-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 10.2 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for prollytree-0.4.0-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 95ea67e404a40dcba3ba7e5f76bc93be3d0a381c422244e190af486184aa8de3
MD5 b365ac2df61d41cb2e325237e437449f
BLAKE2b-256 4a24b760fe9c0f89c652807f92c053a7e5028ae8b7cdd0affaae716372b9f0ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for prollytree-0.4.0-cp38-abi3-win_amd64.whl:

Publisher: release.yml on zhangfengcdt/prollytree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prollytree-0.4.0-cp38-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for prollytree-0.4.0-cp38-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 5c99d26fd9fa9d0eb7a521868e7c9c5bb1b70799fadbb8551733696a556c32f0
MD5 fdfbd2a56d722d801f29fae67b21d713
BLAKE2b-256 147ad661155287162d290c87d897c103748c4b0c1ac980b9c46b4ccc2bffc8f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for prollytree-0.4.0-cp38-abi3-manylinux_2_28_x86_64.whl:

Publisher: release.yml on zhangfengcdt/prollytree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prollytree-0.4.0-cp38-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for prollytree-0.4.0-cp38-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 86b47ee643379d9dbe15c8c9cb34b41aca96bb8d13cec3a0b64f418c98bc07a6
MD5 72125fb1e984aaf08fa814547ed2e1ab
BLAKE2b-256 61d694842ff47849296a8b63f5b1862c6539bffdfb9f537d06c7fcfe548f43ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for prollytree-0.4.0-cp38-abi3-manylinux_2_28_aarch64.whl:

Publisher: release.yml on zhangfengcdt/prollytree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file prollytree-0.4.0-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for prollytree-0.4.0-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 01f131e675b5f3a1ffeb25dbd38a14f62d80c39494f2868ade4f467f13e65f82
MD5 782de74816f48f76f54b02c8a9fb361d
BLAKE2b-256 ded72fc47bcd98352863a42fb3006a84dbd724c3906fcffd68c1a3b231c200f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for prollytree-0.4.0-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on zhangfengcdt/prollytree

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page