GPU-accelerated version-controlled vector database

These details have not been verified by PyPI

Project links

Project description

GitDB

GPU-accelerated version-controlled vector database. No server. No Docker. No config. pip install gitdb and go.

Why GitDB

Version control is native — git log, git diff, git branch, git merge for your vectors. Nobody else has this.
Time-travel queries — db.query_text("revenue", at="v1.0") searches an old snapshot. Chroma can't do this.
CEPH CRUSH placement — deterministic data routing without a coordinator. Scales horizontally by adding peers.
P2P distributed — no central server. Peers sync over SSH like git remotes. Each node is a full replica.
FoundationDB features — hooks, transactions, watches, secondary indexes, schema enforcement. Enterprise-grade.
Spreading activation — background semantic priming makes frequent queries instant. Multi-hop association chains emerge naturally.
CLI-first — gitdb init && gitdb add --text "doc" && gitdb commit -m "init". Works from terminal like git.
Embedded — no server, no Docker, no config. pip install gitdb and go.

Git for tensors. SQLite for metadata. Arctic/NV-EmbedQA for semantics. CEPH CRUSH placement. P2P distributed. FoundationDB-inspired hooks, transactions, watches, indexes, snapshots, schema enforcement. Native backup/restore. Universal ingest (SQLite, MongoDB, CSV, Parquet, PDF, text).

Install

pip install -e .                    # Core (torch + zstandard)
pip install -e ".[embed]"           # + Arctic Embed, NV-EmbedQA
pip install -e ".[dev]"             # + pytest + embedding models

Quick Start

# Create a store
gitdb init my_store --dim 1024

# Add text — auto-embeds with Arctic Embed L
gitdb add --text "quarterly revenue report" "employee handbook" "NDA template"
gitdb commit -m "Initial documents"

# Semantic search
gitdb query --text "financial documents" -k 5

# Structured query (SQLite-style)
gitdb select --where '{"category": "finance"}' --format table

# Time-travel
gitdb query --text "revenue" --at HEAD~3

# Branch, edit, merge
gitdb branch experiment
gitdb switch experiment
gitdb add --text "experimental data"
gitdb commit -m "Experiment"
gitdb switch main
gitdb merge experiment

Python API

from gitdb import GitDB

db = GitDB("my_store", dim=1024, device="mps")  # mps / cuda / cpu

# ── Add vectors ──────────────────────────────────────────
# From text (auto-embeds with Arctic)
db.add(texts=["quarterly revenue report", "employee handbook"])

# From tensors with metadata
db.add(
    embeddings=tensor,               # [N, 1024]
    documents=["doc1", "doc2"],
    metadata=[{"category": "finance", "score": 0.9}, {"category": "hr"}],
)

# From JSONL
db.import_jsonl("data.jsonl", embed_texts=True)

db.commit("Add documents")

# ── Query ────────────────────────────────────────────────
# Semantic search
results = db.query_text("financial projections", k=10)

# Hybrid: semantic + structured filter
results = db.query_text(
    "high value contracts",
    k=10,
    where={"value": {"$gt": 1000000}, "region": {"$in": ["UAE", "US"]}},
)

# Time-travel query
results = db.query_text("revenue", at="v1.0")

# ── Structured Queries (SQLite-style) ────────────────────
# SELECT with operators
rows = db.select(
    fields=["document", "category", "score"],
    where={"score": {"$gt": 0.8}},
    order_by="score",
    reverse=True,
    limit=20,
)

# GROUP BY
counts = db.group_by("category")
# → {"finance": 15, "legal": 8, "hr": 3}

averages = db.group_by("category", agg_field="score", agg_fn="avg")
# → {"finance": 0.85, "legal": 0.72}

# ── Git Operations ───────────────────────────────────────
db.commit("Add Q2 data")
db.branch("experiment")
db.switch("experiment")
db.merge("experiment", strategy="union")  # union / ours / theirs
db.cherry_pick("abc123")
db.revert("abc123")
db.stash("WIP")
db.stash_pop()
db.tag("v1.0")
db.log(limit=20)
db.diff("main", "feature")
db.blame()
db.bisect(test_fn)
db.rebase("main")
db.purge(where={"author": "intern"}, reason="cleanup")
db.reflog()
db.gc()

# ── Remotes ──────────────────────────────────────────────
db.remote_add("origin", "/path/to/remote")
db.push("origin", "main")
db.pull("origin", "main")
db.fetch("origin", "main")

# Pull requests
pr = db.pr_create("Add feature vectors", source_branch="feature")
db.pr_comment(pr.id, "Alice", "Looks good")
db.pr_merge(pr.id)

# ── Export / Import ──────────────────────────────────────
db.export_jsonl("backup.jsonl", include_embeddings=True)
db.import_jsonl("data.jsonl", embed_texts=True)

# ── FoundationDB-Inspired Features ─────────────────────
# Hooks
db.hook("pre-commit", lambda **ctx: True)
db.hook("post-commit", lambda **ctx: print(ctx["commit_hash"]))

# Transactions
with db.transaction() as tx:
    tx.add(texts=["doc1", "doc2"])
    tx.remove(where={"old": True})

# Watches
db.watches.watch_branch("main", lambda event, ctx: notify(ctx))
db.watches.watch({"category": "finance"}, lambda e, c: audit(c))

# Secondary indexes
db.create_index("category")
db.create_index("score", index_type="range")

# Schema enforcement
db.set_schema({"required": ["category"], "properties": {"category": {"type": "string"}}})

# Snapshots
snap = db.snapshot("v1")
snap.query(vector, k=10)

# ── Backup & Restore ───────────────────────────────────
db.backup("store.gitdb-backup")
db.backup_incremental("store_incr.gitdb-incr")
db.backup_verify()
db.backup_list()

# ── P2P Distributed ────────────────────────────────────
from gitdb.distributed import DistributedGitDB, Peer
ddb = DistributedGitDB(db, self_name="laptop")
ddb.add_peer(Peer("spark", "user@spark", shard_path="~/shard", weight=2.0))
ddb.add(texts=["doc1"], replicas=2)        # CRUSH-routed
ddb.query_distributed(vector, k=10)        # Scatter/gather
ddb.sync()                                 # Push/pull all peers

# ── Emirati AC (Spreading Activation) ───────────────────
db.ac.start()                        # Engine on, AC running
results = db.ac.primed(10)           # Pre-ranked vectors (instant)
drift = db.ac.drift()                # Semantic drift detection
stats = db.ac.stats()                # Activation stats
db.ac.stop()

Query Operators

Full structured query engine on metadata. Works standalone or combined with vector search.

Operator	Example	Description
equality	`{"field": "value"}`	Exact match
`$gt`	`{"score": {"$gt": 0.8}}`	Greater than
`$gte`	`{"score": {"$gte": 0.8}}`	Greater than or equal
`$lt`	`{"score": {"$lt": 0.5}}`	Less than
`$lte`	`{"score": {"$lte": 0.5}}`	Less than or equal
`$ne`	`{"status": {"$ne": "draft"}}`	Not equal
`$in`	`{"region": {"$in": ["UAE", "US"]}}`	In list
`$nin`	`{"type": {"$nin": ["test"]}}`	Not in list
`$contains`	`{"tags": {"$contains": "urgent"}}`	String/list contains
`$regex`	`{"document": {"$regex": "^Revenue"}}`	Regex match
`$exists`	`{"score": {"$exists": true}}`	Field exists
`$and`	`{"$and": [{...}, {...}]}`	Logical AND
`$or`	`{"$or": [{...}, {...}]}`	Logical OR
`$not`	`{"$not": {"status": "draft"}}`	Logical NOT

Dotted paths ("nested.field.deep") and array indexing ("tags[0]") are supported.

Embedding Models

Model	Params	Dimensions	Matryoshka	Use Case
Arctic Embed L (default)	335M	1024	No	Fast, lightweight, strong retrieval
NV-EmbedQA-1B-v2	1.2B	384-2048	Yes	Max quality, configurable dimensions

# Use Arctic (default)
gitdb add --text "document"

# Use NV-EmbedQA with Matryoshka
gitdb add --text "document" --embed-model nv-embed-qa

# Re-embed entire store with different model
gitdb re-embed --model nv-embed-qa -c

# List available models
gitdb embed list

Matryoshka Dimensions (NV-EmbedQA)

Same model, different vector sizes. Trade storage for quality:

Dim	Size per vector	1M vectors	Quality loss
384	1.5 KB	1.5 GB	~5%
512	2 KB	2 GB	~3%
768	3 KB	3 GB	~1.5%
1024	4 KB	4 GB	~0.5%
2048	8 KB	8 GB	Baseline

Emirati AC (Spreading Activation)

Named after how Emiratis leave their car AC running so it's pre-chilled when they walk out. The GPU continuously pattern-matches your operations against the vector store. Vectors activate neighboring vectors through multi-hop association chains.

db.ac.start()  # Engine on

# As you browse, AC silently activates related vectors:
db.select(where={"type": "nda"})           # AC notices: NDA interest
# → activates confidentiality vectors      # 1 hop
# → activates IP protection vectors        # 2 hops
# → activates trade secret policies        # 3 hops

# Query — results already warm
results = db.ac.primed(10)  # Instant. Zero search time.

# Or boost normal queries with activation levels
results = db.query_text("data protection")  # AC-boosted ranking

# Drift detection
db.add(texts=["pizza menu", "catering prices"])
db.ac.drift()
# → {"magnitude": 0.42, "severity": "medium"}
# Recent additions diverging from branch centroid

Tuning knobs: POLL_INTERVAL, ACTIVATION_DECAY, SPREAD_FACTOR, NEIGHBOR_K, HOT_CACHE_SIZE — all configurable on db.ac.

Hooks (FoundationDB-inspired)

Pre/post event hooks for commit, merge, push, and drift. Pre-hooks can reject operations.

# Reject commits without metadata
def require_metadata(**ctx):
    db = ctx["db"]
    for meta in db.tree.metadata:
        if not meta.metadata:
            return False
    return True

db.hook("pre-commit", require_metadata)
db.hook("post-commit", lambda **ctx: print(f"Committed: {ctx['commit_hash'][:8]}"))
db.hook("pre-merge", lambda **ctx: ctx["branch"] != "protected")
db.hook("on-drift", lambda **ctx: alert(ctx["magnitude"]))

# List / clear
db.hooks.list_hooks()
db.hooks.clear("pre-commit")

Events: pre-commit, post-commit, pre-merge, post-merge, pre-push, post-push, on-drift

Transactions

Atomic multi-operation batches. All succeed or all roll back.

with db.transaction() as tx:
    tx.add(texts=["new document"])
    tx.remove(where={"status": "deprecated"})
    tx.update_embeddings([0, 1], new_embeddings)
    # If anything raises, everything rolls back

Watches (Subscriptions)

Subscribe to changes on branches or metadata patterns. Callbacks fire on commit, push, or pull.

# Watch a branch
wid = db.watches.watch_branch("production", lambda event, ctx: notify(ctx))

# Watch metadata patterns
wid = db.watches.watch(
    {"category": "finance"},
    lambda event, ctx: audit_log(ctx)
)

# List / unwatch
db.watches.list_watches()
db.watches.unwatch(wid)

Secondary Indexes

Hash indexes for exact lookups, range indexes for min/max queries. Rebuilt on load, updated incrementally.

# Create indexes
db.create_index("category")                  # Hash index (default)
db.create_index("score", index_type="range") # Range index

# Fast lookups (no scan)
results = db.indexes.lookup("category", "finance")    # O(1) exact match
results = db.indexes.range_lookup("score", 0.8, 1.0)  # O(log n) range

# List / drop
db.list_indexes()
db.drop_index("category")

Schema Enforcement

JSON Schema validation on metadata. Rejects bad data at add() time.

db.set_schema({
    "required": ["category", "source"],
    "properties": {
        "category": {"type": "string", "enum": ["finance", "legal", "hr"]},
        "score": {"type": "number", "minimum": 0, "maximum": 1},
        "source": {"type": "string", "minLength": 1},
        "tags": {"type": "array"},
    },
    "additionalProperties": False,
})

# This works:
db.add(texts=["doc"], metadata=[{"category": "finance", "source": "bloomberg", "score": 0.9}])

# This raises SchemaError:
db.add(texts=["doc"], metadata=[{"category": "invalid"}])

# Validate existing data
db.get_schema()
db.set_schema(None)  # Clear

Supports: type (string/number/integer/boolean/array/object/null), required, enum, minimum/maximum, minLength/maxLength, pattern, additionalProperties.

Snapshots

Cheap read-only frozen views. Cloned tensor, referenced metadata. In-memory only.

snap = db.snapshot("before_experiment")

# Mutate the database freely...
db.add(texts=["experimental data"])
db.remove(where={"old": True})

# Query the snapshot — unchanged
results = snap.query(vector, k=10)
results = snap.query_text("revenue", k=5)
rows = snap.select(where={"category": "finance"})

# List snapshots
db.snapshots()

Native Backup & Restore

Full and incremental backups. tar.zst format. No external tools needed.

# Full backup
manifest = db.backup("my_store.gitdb-backup")

# Incremental backup (only new objects since last backup)
manifest = db.backup_incremental("my_store_incr.gitdb-incr")

# List backups
db.backup_list()

# Verify store integrity
result = db.backup_verify()
# → {"valid": True, "commits_verified": 42, "total_objects_on_disk": 156}

# Restore
from gitdb.backup import restore
restore("my_store.gitdb-backup", "/path/to/dest")

gitdb backup full -o my_store.gitdb-backup
gitdb backup incremental -o my_store_incr.gitdb-incr
gitdb backup list
gitdb backup verify
gitdb restore my_store.gitdb-backup /path/to/dest

Backup format: .gitdb-backup (full) / .gitdb-incr (incremental) — tar archives compressed with zstd. Each backup includes a JSON manifest sidecar with checksums, object counts, branch refs, and timestamps.

CRUSH Algorithm (CEPH-style Placement)

Deterministic data placement across shards. No central lookup table — any node computes where any vector lives.

from gitdb.crush import CRUSHMap, CRUSHRouter

# Define cluster topology
crush = CRUSHMap()
crush.add_device("ssd_1", weight=2.0, location={"host": "node1", "rack": "rack1"})
crush.add_device("ssd_2", weight=2.0, location={"host": "node1", "rack": "rack1"})
crush.add_device("ssd_3", weight=1.0, location={"host": "node2", "rack": "rack2"})
crush.add_device("ssd_4", weight=1.0, location={"host": "node2", "rack": "rack2"})

router = CRUSHRouter(crush)

# Where should this vector go? Deterministic — same answer every time.
router.place("vector_abc123", replicas=2)
# → ["ssd_1", "ssd_3"]  (different racks for fault tolerance)

# Add a node — only ~1/N data moves
crush.add_device("ssd_5", weight=1.0, location={"host": "node3", "rack": "rack3"})
plan = router.rebalance_plan(old_map)
# → {"moves": 200, "total": 1000, "percent": 20.0}

# Check distribution uniformity
router.distribution(sample_keys, replicas=2)
# → {"ssd_1": 510, "ssd_2": 498, "ssd_3": 251, "ssd_4": 241}  (weight-proportional)

Straw2 bucket selection — the same algorithm CEPH uses. Deterministic, weight-proportional, minimal disruption on topology changes.

P2P Distributed Mode

Every node is equal. No coordinator. CRUSH routes vectors, SSH transports them, git-style merge handles conflicts.

from gitdb import GitDB
from gitdb.distributed import DistributedGitDB, Peer

db = GitDB("my_shard", dim=1024, device="mps")
ddb = DistributedGitDB(db, self_name="laptop")

# ── Add peers (SSH transport, like git remotes) ────────
ddb.add_peer(Peer("spark", "xentureon@100.89.91.91",
                   shard_path="~/stores/vectors",
                   weight=2.0,  # DGX gets 2x the data
                   location={"host": "spark", "rack": "home"}))

ddb.add_peer(Peer("cloud", "ubuntu@ec2.example.com",
                   shard_path="/data/vectors",
                   weight=1.0,
                   location={"host": "cloud", "rack": "aws"}))

# ── Add vectors — CRUSH routes to correct shards ──────
ddb.add(texts=["quarterly report", "NDA draft"], replicas=2)
ddb.commit("Add documents")

# ── Query — scatter to all shards, merge top-k ────────
results = ddb.query_distributed(vector, k=10)

# ── Sync — push outbox, pull updates ──────────────────
ddb.sync()                  # All peers
ddb.sync("spark")           # Just Spark

# ── Rebalance — when topology changes ─────────────────
ddb.add_peer(Peer("gpu_box", "user@gpu.local", weight=3.0))
plan = ddb.rebalance(dry_run=True)
# → {"foreign_count": 150, "should_migrate": {"gpu_box": [...]}}
ddb.rebalance(dry_run=False)  # Actually queue migrations

# ── Gossip — peers discover each other ────────────────
other_topology = other_node.gossip_map()
ddb.apply_gossip(other_topology)  # Merges peer lists

# ── Status ────────────────────────────────────────────
ddb.status()
# → {"self": "laptop", "peers_total": 3, "peers_active": 2,
#    "local_vectors": 5000, "pending_outbox": 42}

Node A (laptop)          Node B (Spark)         Node C (cloud)
┌─────────────┐         ┌─────────────┐        ┌─────────────┐
│ GitDB shard │◄──SSH──►│ GitDB shard │◄──SSH──►│ GitDB shard │
│ CRUSH map   │  push/  │ CRUSH map   │  push/  │ CRUSH map   │
│ Peer list   │  pull   │ Peer list   │  pull   │ Peer list   │
└─────────────┘         └─────────────┘        └─────────────┘
     All nodes have the same CRUSH map.
     Any node can route any vector. No coordinator.

How it works:

Add: CRUSH computes placement → vectors stored locally + queued in outbox for remote peers
Query: scatter query to all active peers via SSH → merge + dedup results → return top-k
Sync: push outbox via existing git remotes → pull updates from peers
Join: new node gets CRUSH map → rebalance moves ~1/N data
Leave: mark node down → replicas re-placed to surviving nodes
Conflict: git-style branch/merge semantics handle divergence

Universal Ingest

One command to swallow entire databases. SQLite, MongoDB, CSV, Parquet, PDF, text — auto-detected, auto-chunked, auto-committed.

from gitdb import GitDB

db = GitDB("my_store", dim=1024, device="mps")

# Swallow an entire SQLite database
db.ingest("legacy.db")
# → Reads all tables, auto-detects text columns, embeds them, preserves all columns as metadata

# Import MongoDB dump (JSON, JSONL, or BSON)
db.ingest("users.jsonl")
# → Handles $oid, $date, $numberLong; flattens nested objects with dot notation

# CSV / TSV
db.ingest("data.csv")                          # Auto-detect delimiter and text column
db.ingest("data.tsv", text_column="content")   # Or specify it

# Parquet
db.ingest("warehouse.parquet")

# PDF — sentence-boundary-aware chunking
db.ingest("report.pdf", chunk_size=500, chunk_overlap=50)

# Plain text / Markdown
db.ingest("notes.md")

# Entire directory — batch ingest everything
db.ingest("./documents/", pattern="*.pdf", recursive=True)

# CLI
gitdb ingest legacy.db                       # SQLite
gitdb ingest dump.jsonl                      # MongoDB
gitdb ingest data.csv --text-column content  # CSV with explicit text column
gitdb ingest report.pdf --chunk-size 1000    # PDF with custom chunking
gitdb ingest ./docs/ --pattern "*.md"        # Directory batch
gitdb ingest data.tsv --no-commit            # Ingest without auto-commit

Format	Extensions	Features
SQLite	`.db`, `.sqlite`, `.sqlite3`	All tables, auto-detect text columns, preserve all metadata
MongoDB	`.json`, `.jsonl`, `.bson`	Extended JSON (`$oid`, `$date`), nested object flattening
CSV/TSV	`.csv`, `.tsv`, `.tab`	Auto-detect delimiter + text column, type inference
Parquet	`.parquet`, `.pq`	Via pyarrow, auto-detect text column
PDF	`.pdf`	Via pymupdf/PyPDF2, sentence-boundary chunking with overlap
Text	`.txt`, `.md`, `.rst`, `.log`	Chunk with configurable size and overlap

Semantic Git Operations

Git operations that understand meaning, not just hashes.

# Cherry-pick by concept (not by commit hash)
gitdb semantic-cherry-pick feature "authentication"
# → Finds and picks commits whose vectors match "authentication"

# Blame by concept
gitdb semantic-blame "toxic content" --threshold 0.5
# → Shows which commits introduced semantically similar vectors

# Re-embed on rebase (upgrade vectors when rebasing)
db.rebase("main")
db.re_embed(embed_model="nv-embed-qa")
db.commit("Rebased + re-embedded")

CLI Reference

gitdb init <name> [--dim N] [--device D]     Create store
gitdb status                                  Working tree status
gitdb add <file> | --text T | --text-file F   Add vectors
gitdb commit -m "message"                     Commit changes
gitdb log [-n N]                              Show history
gitdb diff <ref_a> <ref_b>                    Compare refs
gitdb query --text T | --vector F [-k N]      Search
gitdb select [--where JSON] [--fields F]      Structured query
gitdb group-by <field> [--agg-fn F]           Aggregation
gitdb branch [name]                           List/create branch
gitdb switch <branch>                         Switch branch
gitdb merge <branch> [--strategy S]           Merge
gitdb cherry-pick <ref>                       Cherry-pick commit
gitdb revert <ref>                            Revert commit
gitdb stash [save|pop|list]                   Stash changes
gitdb tag <name>                              Tag current HEAD
gitdb reset [ref]                             Reset to ref
gitdb blame                                   Vector provenance
gitdb reflog                                  HEAD movement log
gitdb rebase <onto>                           Rebase branch
gitdb gc [--keep N]                           Garbage collect
gitdb purge --where JSON --reason R           Purge from history
gitdb filter-branch <transform>               Transform all vectors
gitdb show [ref]                              Show commit details
gitdb amend [-m "msg"]                        Amend last commit
gitdb squash <N> [-m "msg"]                   Squash N commits
gitdb fork <dest>                             Fork/clone store
gitdb clone <dest>                            Clone store
gitdb note [-m "msg"]                         Notes on commits
gitdb comment [-m "msg"]                      Comments on commits
gitdb head                                    Show HEAD
gitdb remote [add|remove|list]                Manage remotes
gitdb push <remote> [branch]                  Push to remote
gitdb pull <remote> [branch]                  Pull from remote
gitdb fetch <remote> [branch]                 Fetch from remote
gitdb pr [create|list|show|merge|close]       Pull requests
gitdb export <file> [--embeddings]            Export JSONL
gitdb import <file> [--embed]                 Import JSONL
gitdb embed [list|text|info]                  Embedding tools
gitdb re-embed [--model M]                    Re-embed all vectors
gitdb semantic-blame <query>                  Blame by concept
gitdb semantic-cherry-pick <branch> <query>   Cherry-pick by meaning
gitdb ac [status|primed|drift|start|stop]     Emirati AC engine
gitdb backup [full|incremental|list|verify]   Backup operations
gitdb restore <backup> <dest> [--force]       Restore from backup
gitdb snapshot [create|list|query]            In-memory snapshots
gitdb hook [list|clear]                       Event hooks
gitdb watch [list|clear]                      Change watches
gitdb index [create|drop|list|lookup]         Secondary indexes
gitdb schema [show|set|clear|validate]        Schema enforcement
gitdb ingest <file|dir> [--text-column C]     Universal ingest

Architecture

13,150 lines of Python across 21 modules. 394 tests.

┌──────────────────────────────────────────────────────────────┐
│                    GitDB v0.6.0                               │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  P2P Layer ──→ Distributed Cluster   ← P2P Network          │
│       │              │                  gossip, sync         │
│       │         distributed.py          scatter/gather query │
│       ▼              │                                       │
│  CRUSH ────→ Vector Placement        ← CEPH Algorithm       │
│       │              │                  straw2 selection     │
│       │         crush.py                deterministic routing│
│       ▼              │                                       │
│  JSONL/JSON ──→ Metadata Store       ← Structured DB        │
│       │              │                  select, group-by     │
│       │         structured.py           $gt $in $regex       │
│       ▼              │                                       │
│  Schema ────→ Validation             ← Schema Enforcement   │
│       │              │                  JSON Schema subset   │
│       │         schema.py               required, types,    │
│       ▼              │                  enum, bounds, regex  │
│  Embedding ──→ GPU Tensor Store      ← Vector DB            │
│       │              │                  cosine sim           │
│       │         working_tree.py         top-k search         │
│       ▼              │                                       │
│  Indexes ───→ Fast Lookups           ← Secondary Indexes    │
│       │              │                  hash + range indexes │
│       │         indexes.py              O(1) exact, O(logn) │
│       ▼              │                                       │
│  Delta Chain ──→ Version History     ← Git Engine            │
│       │              │                  branch, merge        │
│       │         objects.py              blame, bisect        │
│       ▼              │                  purge, push/pull     │
│  Hooks ────→ Event Callbacks         ← FoundationDB-style   │
│       │              │                  pre/post commit/merge│
│       │         hooks.py                reject operations    │
│       ▼              │                                       │
│  Watches ──→ Subscriptions           ← Change Notifications │
│       │              │                  branch + metadata    │
│       │         watches.py              pattern matching     │
│       ▼              │                                       │
│  Emirati AC ──→ Hot Cache            ← Spreading Activation │
│       │              │                  multi-hop chains     │
│       │         ambient.py              drift detection      │
│       ▼              │                                       │
│  Snapshots ──→ Frozen Views          ← Read-only Snapshots  │
│       │              │                  zero-copy metadata   │
│       │         snapshots.py            query, select        │
│       ▼              │                                       │
│  Backup ───→ tar.zst Archives        ← Native Backup        │
│       │              │                  full + incremental   │
│       │         backup.py               verify, restore     │
│       ▼              │                                       │
│  Ingest ───→ Universal Import         ← Auto-detect         │
│       │              │                  SQLite, MongoDB      │
│       │         ingest.py               CSV, Parquet, PDF   │
│       ▼              │                                       │
│  QUERY ──→ Boosted Results                                   │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Modules:
  core.py          2,383 lines   Main interface (60+ methods)
  cli.py           1,440 lines   CLI (66+ commands)
  distributed.py     765 lines   P2P distributed layer
  crush.py           600 lines   CEPH CRUSH algorithm
  ambient.py         459 lines   Emirati AC engine
  backup.py          457 lines   Native backup & restore
  remote.py          387 lines   Push/pull/fetch (local + SSH)
  structured.py      322 lines   Query operators, aggregation
  delta.py           288 lines   Sparse tensor deltas + zstd
  objects.py         260 lines   Content-addressed object store
  working_tree.py    208 lines   GPU-resident tensor + search
  embed.py           200 lines   Arctic + NV-EmbedQA
  pullrequest.py     197 lines   PR store with comments
  indexes.py         138 lines   Secondary indexes (hash + range)
  types.py           127 lines   Data types
  schema.py          125 lines   JSON Schema validation
  watches.py         123 lines   Change subscriptions
  snapshots.py       118 lines   Read-only frozen views
  ingest.py          909 lines   Universal ingest (6 formats)
  hooks.py            78 lines   Event hook system

What Nothing Else Can Do

Capability	GitDB	Pinecone	Weaviate	Qdrant	ChromaDB	SQLite
Vector similarity search	Yes	Yes	Yes	Yes	Yes	No
Structured metadata queries	Yes	Limited	Yes	Yes	Yes	Yes
Version control (branch/merge)	Yes	No	No	No	No	No
Time-travel queries	Yes	No	No	No	No	No
Semantic blame/bisect	Yes	No	No	No	No	No
Cherry-pick by meaning	Yes	No	No	No	No	No
Purge from all history	Yes	No	No	No	No	No
Spreading activation	Yes	No	No	No	No	No
Drift detection	Yes	No	No	No	No	No
Pull requests on vectors	Yes	No	No	No	No	No
Auto-embed from text	Yes	No	Yes	No	Yes	No
Matryoshka dimensions	Yes	No	No	No	No	No
GPU-accelerated	Yes	Cloud	Cloud	Yes	No	No
Export/Import JSONL	Yes	No	Yes	Yes	Yes	No
Pre/post event hooks	Yes	No	No	No	No	No
Atomic transactions	Yes	No	No	No	No	Yes
Change watches/subscriptions	Yes	No	No	No	No	No
Secondary indexes	Yes	Cloud	Auto	Yes	No	Yes
Schema enforcement	Yes	No	Yes	No	No	Yes
Read-only snapshots	Yes	No	No	No	No	No
Native backup/restore	Yes	Cloud	Cloud	Yes	No	No
CRUSH placement algorithm	Yes	No	No	No	No	No
P2P distributed (no coordinator)	Yes	Cloud	Cloud	Cloud	No	No
Scatter/gather distributed query	Yes	Cloud	Cloud	Cloud	No	No
Gossip topology discovery	Yes	No	No	No	No	No
Universal ingest (6 formats)	Yes	No	No	No	No	No

LLM Integration

GitDB connects to LLMs via three pathways, depending on your stack.

1. MCP Server (Claude, Cursor, AI IDEs)

MCP (Model Context Protocol) is the standard for tool integration with Claude and AI-native IDEs. GitDB exposes its full API as MCP tools:

# gitdb_mcp_server.py — minimal MCP server
from gitdb import GitDB
import json

# MCP tools map directly to GitDB methods:
# - gitdb_query(text, k, where) → db.query_text()
# - gitdb_add(texts) → db.add(texts=)
# - gitdb_select(where, fields) → db.select()
# - gitdb_log(limit) → db.log()
# - gitdb_diff(ref_a, ref_b) → db.diff()
# - gitdb_commit(message) → db.commit()

2. REST API (Ollama, LMStudio, any HTTP client)

Run GitDB as an HTTP server with OpenAI-compatible endpoints:

# gitdb_server.py — REST API
from flask import Flask, request, jsonify
from gitdb import GitDB

app = Flask(__name__)
db = GitDB("my_store", dim=1024, device="mps")
db.ac.start()

@app.route("/v1/query", methods=["POST"])
def query():
    data = request.json
    results = db.query_text(data["text"], k=data.get("k", 10))
    return jsonify({
        "results": [{"document": d, "score": s, "metadata": m}
                     for d, s, m in zip(results.documents, results.scores, results.metadata)]
    })

@app.route("/v1/add", methods=["POST"])
def add():
    data = request.json
    indices = db.add(texts=data["texts"], metadata=data.get("metadata"))
    return jsonify({"added": len(indices)})

@app.route("/v1/select", methods=["POST"])
def select():
    data = request.json
    rows = db.select(where=data.get("where"), fields=data.get("fields"))
    return jsonify({"rows": rows})

# OpenAI-compatible embeddings endpoint
@app.route("/v1/embeddings", methods=["POST"])
def embeddings():
    from gitdb.embed import embed
    data = request.json
    vectors = embed(data["input"])
    return jsonify({
        "data": [{"embedding": v.tolist(), "index": i} for i, v in enumerate(vectors)]
    })

3. Python SDK (LangChain, LlamaIndex, direct)

# LangChain retriever
from gitdb import GitDB

class GitDBRetriever:
    def __init__(self, path, dim=1024):
        self.db = GitDB(path, dim=dim, device="mps")

    def get_relevant_documents(self, query, k=5):
        results = self.db.query_text(query, k=k)
        return [{"page_content": d, "metadata": m}
                for d, m in zip(results.documents, results.metadata)]

JSONL Format

GitDB uses JSONL (JSON Lines) for import/export. One JSON object per line.

Full format (with embeddings)

{"id": "a3f1b2c4", "document": "NDA with Acme Corp", "metadata": {"type": "nda", "value": 500000}, "embedding": [0.123, -0.456, ...]}
{"id": "b7e2d3f5", "document": "MSA with Globex", "metadata": {"type": "msa", "value": 2000000}, "embedding": [0.789, 0.012, ...]}

Text-only format (auto-embed on import)

{"document": "NDA with Acme Corp", "metadata": {"type": "nda", "value": 500000}}
{"document": "MSA with Globex", "metadata": {"type": "msa", "value": 2000000}}

Import text-only JSONL with auto-embedding:

gitdb import data.jsonl --embed -c

Fields

Field	Type	Required	Description
`id`	string	No	Vector ID (auto-generated if omitted)
`document`	string	No	Raw text (used for embedding and display)
`metadata`	object	No	Arbitrary JSON metadata
`embedding`	float[]	No	Pre-computed embedding vector

On-Disk Structure

my_store/
  .gitdb/
    config                    # Store config (dim, device, version)
    HEAD                      # Current commit hash
    refs/
      heads/
        main                  # Branch refs
        feature
      tags/
        v1.0                  # Tag refs
    objects/
      a3/f1b2c4...           # Content-addressed objects (commits + deltas)
      b7/e2d3f5...           # Two-char prefix directories (git-style)
    cache/
      tensors/
        <hash>.pt             # Cached reconstructed tensors
        <hash>.meta.json      # Cached metadata
    remotes.json              # Remote configurations
    prs/
      index.json              # Pull request store
    stash/
      0.pt, 0.meta.json      # Stashed states
    backups/
      history.json            # Backup history
    distributed/
      peers.json              # P2P peer registry
      crush_map.json          # CRUSH topology + rules
      outbox/                 # Queued vectors for remote peers
        spark.jsonl
        cloud.jsonl
    reflog                    # Append-only HEAD movement log
    schema.json               # Schema definition (if set)
    notes/                    # Commit notes
    comments/                 # Commit comments

Object Format

Commits and deltas are stored as content-addressed objects, hashed with SHA-256.

Commit object (JSON):

{
  "parent": "abc123...",
  "parent2": null,
  "delta_hash": "def456...",
  "message": "Add Q2 data",
  "timestamp": 1710460800.0,
  "stats": {"added": 50, "removed": 0, "modified": 3},
  "tensor_rows": 500
}

Delta object (binary):

GDBD (magic bytes) + version byte + packed indices + tensor bytes + JSON metadata
Compressed with zstd level 3

Deltas are sparse: they only store what changed (additions, deletions, modifications), not the full tensor. A commit adding 50 vectors to a 10,000-vector store only stores 50 new vectors, not 10,050.

Internals

Delta Chain Reconstruction

To reconstruct state at any commit, GitDB walks back to the root and replays deltas forward:

root → delta1 → delta2 → delta3 → ... → HEAD

Cached snapshots short-circuit this walk. gitdb gc creates periodic checkpoints.

Three-Way Merge

When merging branches that have diverged:

Find common ancestor (merge base)
Compute delta from ancestor → ours
Compute delta from ancestor → theirs
Apply merge strategy:
- union: Combine all additions from both sides
- ours: Keep our modifications, take their additions
- theirs: Keep their modifications, take our additions
Detect conflicts: vectors modified in both branches

Purge (History Rewriting)

gitdb purge rewrites every commit in history:

Walk every commit from root to HEAD
Filter every delta: remove target vectors from additions/modifications
Skip commits that become empty after filtering
Rehash everything (new content = new hash)
Update all branch refs and tags to new hashes
Invalidate all caches

After purge, the target vectors never existed in any historical state.

Spreading Activation (Emirati AC)

Background thread architecture:

Every 3 seconds:
  1. Embed recent context (last 10 operations) → single vector
  2. GPU matmul against all vectors → direct activation scores
  3. Decay existing activations by 10%
  4. Activate top-20 matches (direct)
  5. Spread: each active vector activates its top-5 neighbors
     (weighted by cosine similarity × spread factor)
  6. Update hot cache (top 50 by activation level)
  7. Check drift (recent additions vs branch centroid)

Multi-hop chains emerge naturally:

"auth tokens" (query)
  → session_management (0.65)    ← direct hit
    → security_audit (0.42)      ← 1 hop
      → vulnerability_scan (0.31) ← 2 hops

Configuration

Store Config

db = GitDB(
    "my_store",
    dim=1024,      # Embedding dimension
    device="mps",  # mps (Apple Silicon) / cuda (NVIDIA) / cpu
)

Emirati AC Tuning

db.ac.POLL_INTERVAL = 3.0       # Seconds between cycles
db.ac.ACTIVATION_DECAY = 0.90   # 10% decay per cycle
db.ac.SPREAD_FACTOR = 0.25      # Spread intensity
db.ac.SPREAD_MIN_SIM = 0.40     # Min similarity to spread
db.ac.NEIGHBOR_K = 5            # Neighbors per vector
db.ac.HOT_CACHE_SIZE = 50       # Hot cache capacity
db.ac.MIN_ACTIVATION = 0.05     # Deactivation threshold
db.ac.DIRECT_WEIGHT = 0.5       # Direct match weight
db.ac.REINFORCE_BONUS = 0.1     # Re-activation bonus

Testing

# Run all tests (fast, no model downloads)
python -m pytest tests/ -m "not slow" -q

# Run specific test suites
python -m pytest tests/test_core.py        # 28 tests — basic operations
python -m pytest tests/test_magic.py       # 55 tests — git commands
python -m pytest tests/test_remote.py      # 18 tests — push/pull/fetch/PRs
python -m pytest tests/test_embed.py       # 11 unit + 18 integration
python -m pytest tests/test_structured.py  # 45 tests — query engine
python -m pytest tests/test_ambient.py     # 20 tests — Emirati AC
python -m pytest tests/test_hooks.py       # 31 tests — hooks + integration
python -m pytest tests/test_watches.py     # 14 tests — change subscriptions
python -m pytest tests/test_schema.py      # 18 tests — schema enforcement
python -m pytest tests/test_indexes.py     # 23 tests — secondary indexes
python -m pytest tests/test_snapshots.py   # 17 tests — snapshots
python -m pytest tests/test_transactions.py # 12 tests — atomic transactions
python -m pytest tests/test_backup.py      # 9 tests — backup & restore
python -m pytest tests/test_crush.py       # 32 tests — CRUSH algorithm
python -m pytest tests/test_distributed.py # 33 tests — P2P distributed
python -m pytest tests/test_ingest.py      # 33 tests — universal ingest

# Run integration tests (requires model download)
python -m pytest tests/test_embed.py -m slow

Test Coverage

Suite	Tests	What's Covered
`test_core.py`	28	add, remove, query, commit, log, diff, checkout, reset, tags, delta serialization, persistence, time-travel
`test_magic.py`	55	cherry-pick, revert, branching, merge (5 strategies), stash, blame, bisect, rebase, GC, reflog, filter-branch, purge (7 tests), HEAD, show, amend, squash, fork, notes
`test_remote.py`	18	remote add/remove, push, pull, fetch, branch push, PRs (create/list/comment/merge/close/filter), commit comments
`test_embed.py`	29	model registry, similarity, embedding, normalization, semantic filter, Matryoshka, GitDB text add/query, re-embed, semantic blame
`test_structured.py`	45	field resolution, all 13 operators, $and/$or/$not, select, group-by (5 agg functions), JSONL export/import, hybrid vector+structured queries
`test_ambient.py`	20	lifecycle, direct activation, spreading, decay, drift detection, AC-boosted queries, hot cache
`test_hooks.py`	31	register/unregister/fire, pre-hook rejection, post-hook fire, idempotent register, clear, integration with commit/merge
`test_hooks_integration.py`	5	pre-commit reject/accept, post-commit fires, pre-merge reject, post-merge fires
`test_watches.py`	14	watch/unwatch, branch watches, where watches, fire on commit, list, ID increment
`test_schema.py`	18	required fields, all type checks, enum, bounds, length, pattern, additionalProperties, GitDB integration, persistence
`test_indexes.py`	23	hash index create/rebuild/lookup/update, range index lookup, drop, list, duplicate create, GitDB integration
`test_snapshots.py`	17	create, query, select, isolation, duplicate name, get/list, repr, tombstone handling
`test_transactions.py`	12	commit, rollback on error, remove, nested operations, edge cases
`test_backup.py`	9	full backup, manifest sidecar, history, incremental, restore, restore+query, overwrite, verify valid/corrupt
`test_crush.py`	32	CRUSH map CRUD, straw2 determinism/weight/uniformity, single/multi-replica placement, host separation, down devices, router place/batch/distribution/rebalance, 144-device cluster
`test_distributed.py`	33	peer registry CRUD/persistence, CRUSH routing (single/multi/deterministic/batch/uniform), distributed add with outbox, local + distributed query with dedup, rebalance planning, gossip discovery, sync
`test_ingest.py`	33	SQLite (single/multi-table, auto-detect text, all metadata), MongoDB (JSON/JSONL, extended JSON, nested flattening), CSV (auto-detect, explicit column, TSV), text chunking (overlap, sentence boundaries), universal auto-detect, directory batch, helpers

Requirements

Python >= 3.10
PyTorch >= 2.0
zstandard >= 0.21
sentence-transformers >= 2.2 (optional, for embedding)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.13.3

Mar 15, 2026

0.13.2

Mar 15, 2026

0.13.1

Mar 15, 2026

0.13.0

Mar 15, 2026

0.12.0

Mar 15, 2026

0.11.0

Mar 15, 2026

0.10.0

Mar 15, 2026

0.8.0

Mar 14, 2026

0.7.0

Mar 14, 2026

This version

0.6.0

Mar 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gitdb_vectors-0.6.0.tar.gz (153.5 kB view details)

Uploaded Mar 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gitdb_vectors-0.6.0-py3-none-any.whl (105.6 kB view details)

Uploaded Mar 14, 2026 Python 3

File details

Details for the file gitdb_vectors-0.6.0.tar.gz.

File metadata

Download URL: gitdb_vectors-0.6.0.tar.gz
Upload date: Mar 14, 2026
Size: 153.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gitdb_vectors-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`33345e37c1929c14ca56af270162996839269505ab9703bcd2d7f2097f44620e`
MD5	`95d5fc49b93148aa49bd5d5d40f39df1`
BLAKE2b-256	`8d507e9acce66e83b5ef3a2dc6703904f657291070f0a9bf212047017eb5122e`

See more details on using hashes here.

File details

Details for the file gitdb_vectors-0.6.0-py3-none-any.whl.

File metadata

Download URL: gitdb_vectors-0.6.0-py3-none-any.whl
Upload date: Mar 14, 2026
Size: 105.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for gitdb_vectors-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3990fc680c7648904ea10d929bb4618752478cab27aac4152cc13879867bdf9`
MD5	`39b82238ccf338ff7924cc7d02f19d91`
BLAKE2b-256	`6d336346b2022640aafab9db6a6225cb7d64db486cc0a0a2a6e9d3ed875517db`

See more details on using hashes here.

gitdb-vectors 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

GitDB

Why GitDB

Install

Quick Start

Python API

Query Operators

Embedding Models

Matryoshka Dimensions (NV-EmbedQA)

Emirati AC (Spreading Activation)

Hooks (FoundationDB-inspired)

Transactions

Watches (Subscriptions)

Secondary Indexes

Schema Enforcement

Snapshots

Native Backup & Restore

CRUSH Algorithm (CEPH-style Placement)

P2P Distributed Mode

Universal Ingest

Semantic Git Operations

CLI Reference

Architecture

What Nothing Else Can Do

LLM Integration

1. MCP Server (Claude, Cursor, AI IDEs)

2. REST API (Ollama, LMStudio, any HTTP client)

3. Python SDK (LangChain, LlamaIndex, direct)

JSONL Format

Full format (with embeddings)

Text-only format (auto-embed on import)

Fields

On-Disk Structure

Object Format

Internals

Delta Chain Reconstruction

Three-Way Merge

Purge (History Rewriting)

Spreading Activation (Emirati AC)

Configuration

Store Config

Emirati AC Tuning

Testing

Test Coverage

Requirements

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes