The Merge Algebra Toolkit — composable, streaming, verified CRDT merge for datasets

These details have not been verified by PyPI

Project links

Project description

crdt-merge

Conflict-free merge for structured data, AI model weights, and agent memory. Define strategies. Merge anything. Prove correctness. Audit every field. Stream at any scale. Zero dependencies.

🆕 What's New

v0.8.2 — "The Adoption Release" (2026-03-29)

Context Memory System, Agentic AI State Merge, and MergeKit Migration CLI. crdt-merge now spans tabular data + model weights + agent memory under one algebraic framework.

🧠 Context Memory System (Category-Defining)

CRDT-merged memory for AI agents. Dedup, merge, and audit agent memories with the same strategies used for tabular data.

from crdt_merge.context import ContextMerge, ContextBloom

bloom = ContextBloom(expected_items=100_000, fp_rate=0.001)
merger = ContextMerge(bloom=bloom, strategy="lww")
result = merger.merge(agent_a_memories, agent_b_memories)
print(result.manifest.summary())  # "3 unique memories from 2 agents"

MemorySidecar — pre-computed metadata for O(1) memory filtering
ContextManifest — self-describing merge attestation with EU AI Act traceability
ContextBloom — 64-shard bloom filter for memory dedup (~10M checks/sec)
ContextConsolidator — bundles thousands of memories into indexed blocks
ContextMerge — quality-weighted, budget-aware context merge

🤖 Agentic AI State Merge

CRDT containers purpose-built for multi-agent orchestration (CrewAI, AutoGen, LangGraph).

from crdt_merge.agentic import AgentState, SharedKnowledge

researcher = AgentState(agent_id="researcher")
researcher.add_fact("revenue_q1", 4_200_000, confidence=0.9)

analyst = AgentState(agent_id="analyst")
analyst.add_fact("revenue_q1", 4_250_000, confidence=0.95)

shared = SharedKnowledge.merge(researcher, analyst)
print(shared.facts["revenue_q1"].confidence)  # 0.95 — higher confidence wins

🔧 MergeKit Migration CLI

crdt-merge migrate mergekit-config.yaml --output merge_pipeline.py

Zero-cost switching for MergeKit users. Supports linear, slerp, ties, dare, task_arithmetic methods.

📖 Full API Reference →

v0.8.1 — The CRDT Architecture Release (2026-03-29)

All 25 model merge strategies are now provably true CRDTs.

v0.8.0 introduced 25 model merge strategies, but a fundamental mathematical limitation meant strategies like SLERP, TIES, DARE, and Fisher could not satisfy CRDT laws (commutativity, associativity, idempotency) when applied directly to raw tensors.

v0.8.1 solves this with a two-layer architecture:

Layer	Responsibility	CRDT?
CRDTMergeState	Collects models via set union	✅ Provably C+A+I
Strategy (unchanged)	Computes merged model atomically	Deterministic pure function

Set union is trivially commutative, associative, and idempotent. Strategies are applied once to the full set during resolve() — never pairwise.

from crdt_merge.model import CRDTMergeState

# Create CRDT states on different replicas
state_a = CRDTMergeState("slerp")
state_a.add(model_weights_1, model_id="llama-7b")

state_b = CRDTMergeState("slerp")
state_b.add(model_weights_2, model_id="mistral-7b")

# Merge in any order — always converges
merged = state_a.merge(state_b)  # Same result as state_b.merge(state_a)
result = merged.resolve()         # Deterministic merged weights

Key features of CRDTMergeState:

OR-Set add/remove semantics with tombstones (add-wins)
SHA-256 Merkle hashing for content-addressable provenance
Version vectors with configurable conflict resolution
Canonical hash-sorted ordering for deterministic cross-replica convergence
Wire serialization via to_dict() / from_dict()
Batch operations: add_batch(), merge_many()
195 new tests — all 25 strategies × 3 CRDT laws verified

Or use the high-level API:

from crdt_merge.model import ModelMerge, ModelMergeSchema

schema = ModelMergeSchema(strategies={"default": "slerp"})
merger = ModelMerge(schema)
result = merger.crdt_merge(
    models=[model_a, model_b, model_c],
    model_ids=["llama", "mistral", "phi"],
)
assert result.metadata["crdt_guaranteed"] is True

📖 Full Architecture Analysis → 📊 Benchmark Results →

v0.8.0 — The Intelligence Release (2026-03-29)

25 model merge strategies across 8 categories, powered by CRDT-native architecture:

Category	Strategies
Basic	WeightAverage, SLERP, TaskArithmetic, LinearInterp
Subspace	TIES, DARE, DELLA, DARE-TIES, ModelBreadcrumbs, EMR, STAR, SVDKnotTying, AdaRank
Weighted	FisherMerge, RegMean, AdaMerging, DAM
Evolutionary	EvolutionaryMerge, GeneticMerge
Unlearning	NegMerge, SplitUnlearnMerge
Calibration	WeightScopeAlignment, RepresentationSurgery
Safety	SafeMerge, LEDMerge

25 model merge strategies. 1,923 tests. 44 modules. Zero breaking changes. crdt-merge enters the AI model merging space with CRDT guarantees, per-parameter provenance, and formal verification.

ModelCRDT — CRDT-native model weight merging with 25 strategies across 8 categories
25 merge strategies — TIES, DARE, SLERP, Task Arithmetic, Fisher, AdaMerging, evolutionary, safety-aware, and more
LoRA merging — first-class adapter merging with rank harmonization
Merge pipelines — multi-stage merge orchestration
Per-parameter provenance 🦄 — full audit trail for every merged weight
Conflict heatmaps 🦄 — layer-level disagreement visualization
Safety analyzer — detect capability degradation from merging
Federated bridge — CRDT merge strategies for federated learning
MergeKit compatibility — import/export MergeKit YAML configs
GPU acceleration — torch-based acceleration for large models
1,923 tests passing. Zero regressions. Zero breaking changes.

Plus:

Per-parameter provenance tracking
Conflict heatmaps with D3/Plotly export
LoRA adapter merging with rank harmonization
Multi-stage DAG merge pipelines
Continual merge with memory budget
Federated learning bridge (FedAvg + FedProx)
MergeKit compatibility (import/export YAML)
GPU acceleration with CUDA-aware chunking
775 new model tests

pip install crdt-merge            # Zero deps, tabular + model core
pip install crdt-merge[model]     # + numpy for model strategies
pip install crdt-merge[gpu]       # + torch for GPU acceleration
pip install crdt-merge[fast]      # + polars for tabular speedup

from crdt_merge.model import ModelCRDT, ModelMergeSchema

# Define per-layer merge strategies
schema = ModelMergeSchema({
    "embed_tokens": "slerp",
    "layers.*.self_attn": "ties",
    "layers.*.mlp": "dare",
    "lm_head": "weight_average",
})

# Merge two models with CRDT guarantees
model = ModelCRDT(weights_a, weights_b, schema=schema)
merged = model.merge()

# Full provenance — which model contributed which parameters
provenance = model.provenance()

Previous: v0.7.1 — "The Polars Engine Release" (2026-03-28)

38.8× faster merges via Polars engine. Opt-in via pip install crdt-merge[fast].

Previous: v0.7.0 — "The Ecosystem Release" (2026-03-28)

MergeQL (SQL interface), 8 ecosystem accelerators (DuckDB, dbt, Polars, Arrow Flight, Airbyte, SQLite, Streamlit, DuckLake), self-merging Parquet, conflict visualization.

Previous: v0.6.0 — "The Architecture Release"

7 new modules: clocks, schema evolution, merkle, arrow, gossip, async_merge, parallel
Arrow-native merge engine — 2.5× measured speedup on A100 (50M rows)
720 tests, zero regressions

What is crdt-merge?

crdt-merge is a Python library for merging datasets with conflict resolution. Instead of "last write wins" or manual dedup scripts, you declare per-field merge strategies — and the library guarantees deterministic, order-independent results with full audit trails.

pip install crdt-merge

What it does

Merges datasets with configurable per-field strategies (max wins, min wins, union, priority, custom)
Streams merges at any scale — O(1) memory for sorted sources, tested to 100M rows
Proves correctness — @verified_merge decorator verifies commutativity, associativity, idempotency
Audits everything — per-field provenance trails show exactly which source won each field and why
Serializes for the wire — compact binary format for cross-language CRDT exchange
Speaks SQL — MergeQL lets you express merges as SQL statements
Plugs into everything — DuckDB, dbt, Polars, Arrow Flight, Airbyte, SQLite, Streamlit accelerators
Zero dependencies — pure Python core, embeds anywhere

What it is NOT

Not a real-time collaboration tool. For collaborative text editing, see Yjs or Loro.
Not a database. No persistence, no queries, no networking. It's a library.
Not a distributed system. Includes gossip state tracking and Merkle sync primitives (v0.6.0), but no built-in networking or consensus. It provides the building blocks that distributed systems can use.

Who is it for?

Anyone merging structured data from multiple sources: data pipelines, ETL, multi-node sync, offline-first apps, federated datasets. Your real alternative today is pandas.merge() (no conflict resolution) or hand-written dedup scripts.

Quick Start

1. Basic Merge

from crdt_merge import merge

# Two datasets with conflicting scores for the same person
dataset_a = [{"id": 1, "name": "Alice", "score": 100}]
dataset_b = [{"id": 1, "name": "Alice", "score": 150}]

result = merge(dataset_a, dataset_b, key="id")
# → [{"id": 1, "name": "Alice", "score": 150}]

By default, when two rows share the same key, the second dataset wins ties. You can control this with prefer="a", prefer="b", or prefer="latest" (uses timestamps).

2. Per-Field Strategies with MergeSchema

The real power is declaring different strategies for different fields:

from crdt_merge import merge
from crdt_merge.strategies import MergeSchema, MaxWins, LWW, UnionSet

schema = MergeSchema(
    default=LWW(),           # Most fields: last-writer-wins
    score=MaxWins(),          # Scores: highest value wins
    tags=UnionSet()           # Tags: merge as set union
)

a = [{"id": 1, "score": 80, "tags": "python;data", "status": "draft"}]
b = [{"id": 1, "score": 95, "tags": "python;ml",   "status": "published"}]

result = merge(a, b, key="id", schema=schema)
# score: 95 (MaxWins), tags: "data;ml;python" (UnionSet), status: "published" (LWW)

8 built-in strategies: LWW, MaxWins, MinWins, UnionSet, Priority, Concat, LongestWins, Custom(fn)

3. Streaming Merge at Scale

from crdt_merge import merge_sorted_stream

# Merge two sorted sources with O(1) memory — works with generators from disk/network
def source_a():
    for i in range(10_000_000):
        yield {"id": i, "value": f"a_{i}"}

def source_b():
    for i in range(0, 10_000_000, 2):
        yield {"id": i, "value": f"b_{i}"}

for batch in merge_sorted_stream(source_a(), source_b(), key="id", batch_size=10000):
    process(batch)  # Memory never exceeds batch_size

4. CRDT Primitives

from crdt_merge import GCounter, PNCounter, LWWRegister, ORSet, LWWMap

# Grow-only counter — nodes increment independently, merge via max-per-node
a = GCounter("node_a")
a.increment("node_a", 10)
b = GCounter("node_b")
b.increment("node_b", 5)
merged = a.merge(b)
print(merged.value)  # 15 — guaranteed correct regardless of merge order

All primitives satisfy CRDT properties: commutative (a ⊔ b = b ⊔ a), associative ((a ⊔ b) ⊔ c = a ⊔ (b ⊔ c)), idempotent (a ⊔ a = a).

5. MergeQL — SQL-Like Merges (v0.7.0)

from crdt_merge.mergeql import MergeQL

ql = MergeQL()
ql.register("nyc", [{"id": 1, "name": "Alice", "salary": 100000}])
ql.register("london", [{"id": 1, "name": "Alice", "salary": 120000}])

result = ql.execute("""
    MERGE nyc, london
    ON id
    STRATEGY salary='max', name='lww'
""")
# salary: 120000 (max wins), name: "Alice" (LWW)

6. Self-Merging Parquet (v0.7.0)

from crdt_merge.parquet import SelfMergingParquet
from crdt_merge.strategies import MergeSchema, LWW, MaxWins

schema = MergeSchema(default=LWW(), salary=MaxWins())
smf = SelfMergingParquet("customers", key="id", schema=schema)
smf.ingest([{"id": 1, "name": "Alice", "salary": 100}])
smf.ingest([{"id": 1, "name": "Alicia", "salary": 120}])
assert smf.read()[0]["salary"] == 120  # MaxWins applied automatically

Feature Matrix

Module	Feature	Since	Description
`core`	GCounter, PNCounter, LWWRegister, ORSet, LWWMap	v0.1.0	CRDT primitives with merge, serialize, deserialize
`dataframe`	`merge()`, `diff()`	v0.1.0	Dataset merge with key matching, conflict resolution, schema support
`dedup`	`dedup_list()`, `dedup_records()`, MinHashDedup	v0.1.0	Exact, fuzzy (bigram), and MinHash deduplication
`json_merge`	`merge_dicts()`, `merge_json_lines()`	v0.1.0	Deep dict merge with LWW, None-as-missing semantics
`strategies`	MergeSchema + 8 strategies	v0.3.0	Per-field strategy DSL: LWW, MaxWins, MinWins, UnionSet, Priority, Concat, LongestWins, Custom
`streaming`	`merge_stream()`, `merge_sorted_stream()`	v0.3.0	O(batch_size) and O(1) memory streaming merge
`delta`	Delta, DeltaStore, `compute_delta()`, `compose_deltas()`	v0.3.0	Delta-state CRDT sync — compute, apply, compose deltas
`provenance`	`merge_with_provenance()`, ProvenanceLog	v0.4.0	Per-field audit trail: which source won, which strategy, why
`verify`	`@verified_merge` decorator	v0.4.0	Property-based testing of commutativity, associativity, idempotency
`wire`	`serialize()`, `deserialize()`, `serialize_batch()`	v0.5.0	Compact binary wire format for all CRDT types
`probabilistic`	MergeableHLL, MergeableBloom, MergeableCMS	v0.5.0	Probabilistic data structures with CRDT merge semantics
`datasets_ext`	`merge_datasets()`, `dedup_dataset()`	v0.1.0	HuggingFace Datasets integration (optional)
`clocks`	HybridLogicalClock, HLC timestamps	v0.6.0	Hybrid Logical Clocks for distributed CRDT ordering
`schema_evolution`	Column mapping, type coercion	v0.6.0	Automatic schema evolution for mismatched datasets
`merkle`	MerkleHashTree, diff, sync	v0.6.0	Merkle hash trees for efficient incremental sync
`arrow`	Arrow-native merge engine	v0.6.0	Apache Arrow-native merge path (2.5× measured on A100)
`_polars_engine`	Polars merge kernel	v0.7.1	Rust-compiled merge: 38.8× peak on A100 via `pip install crdt-merge[fast]`
`gossip`	GossipState, anti-entropy protocol	v0.6.0	Gossip protocol state tracking for convergence
`async_merge`	`async_merge()`, `async_stream()`	v0.6.0	Async/await wrappers for non-blocking merges
`parallel`	`parallel_merge()`, multi-core execution	v0.6.0	Parallel merge execution across multiple cores
`mergeql`	MergeQL SQL interface	v0.7.0	SQL-like CRDT merge: `MERGE t1, t2 ON id STRATEGY score='max'`
`parquet`	SelfMergingParquet	v0.7.0	Parquet files with embedded CRDT metadata that self-merge
`viz`	ConflictTopology, heatmaps	v0.7.0	Conflict analysis: heatmaps, temporal patterns, cluster detection

Ecosystem Accelerators (v0.7.0):

Accelerator	Module	Description
🦆 DuckDB UDF	`accelerators.duckdb_udf`	CRDT merge as native DuckDB SQL functions
🔧 dbt Package	`accelerators.dbt_package`	CRDT merge models for dbt-managed warehouses
🦆 DuckLake	`accelerators.ducklake`	Semantic conflict detection for DuckLake catalogs
🐻 Polars Plugin	`accelerators.polars_plugin`	Native Polars expressions for CRDT operations
✈️ Arrow Flight	`accelerators.flight_server`	Merge-as-a-service over Arrow Flight RPC
🔌 Airbyte	`accelerators.airbyte`	CRDT-aware Airbyte destination connector
💾 SQLite Extension	`accelerators.sqlite_ext`	CRDT merge as SQLite custom functions
📊 Streamlit UI	`accelerators.streamlit_ui`	Visual merge interface with conflict resolution

27 core modules + 8 ecosystem accelerators, ~32,000 lines of source, zero required dependencies.

API Reference

`merge()` — The Main Entry Point

from crdt_merge import merge

result = merge(
    df_a,                    # First dataset (list of dicts, pandas DataFrame, or Polars DataFrame)
    df_b,                    # Second dataset
    key="id",                # Column to match rows on (raises KeyError if not found)
    prefer="latest",         # "a", "b", or "latest" — conflict resolution (raises ValueError if invalid)
    schema=None,             # Optional MergeSchema for per-field strategies (overrides prefer)
    timestamp_col=None,      # Column with timestamps for LWW resolution
    dedup=True,              # Remove exact duplicates in output
    fuzzy_dedup=False,       # Also remove near-duplicates
    fuzzy_threshold=0.85,    # Similarity threshold for fuzzy dedup
)

When key is provided: rows with matching keys are merged, unique rows from both sides are preserved.
When key is None: datasets are appended and deduplication is applied.
When schema is provided: per-field strategies override the prefer parameter for matched rows.
Input/output format matches: pass pandas in → get pandas out. Pass list of dicts → get list of dicts.

`merge_with_provenance()` — Merge + Full Audit Trail

from crdt_merge import merge_with_provenance

merged, log = merge_with_provenance(
    df_a, df_b,
    key="id",
    schema=my_schema,          # Optional MergeSchema
    timestamp_col=None,        # Optional timestamp column
    as_dataframe=False,        # Set True to get pandas DataFrame output
)

# Inspect the audit trail
print(log.summary())           # Human-readable summary
print(log.total_conflicts)     # Number of field-level conflicts resolved
for entry in log.entries:
    print(entry.field, entry.winner, entry.strategy)

# Export
from crdt_merge import export_provenance
json_str = export_provenance(log, format="json")  # Returns string
csv_str = export_provenance(log, format="csv")     # Returns string
log_dict = log.to_dict()                           # Returns dict

`merge_stream()` — Streaming Merge

from crdt_merge import merge_stream, StreamStats

stats = StreamStats()
for batch in merge_stream(
    source_a,                # Iterable of dicts (streamed)
    source_b,                # Iterable of dicts (loaded into memory)
    key="id",
    batch_size=5000,
    schema=my_schema,        # Optional MergeSchema
    prefer="b",              # Optional prefer shorthand
    timestamp_col=None,
    stats=stats,
):
    process(batch)

print(f"{stats.rows_per_second:.0f} rows/s, {stats.batch_count} batches")

Memory: O(|source_b| + batch_size). Loads source_b fully for key lookup, streams source_a in batches.

`merge_sorted_stream()` — True O(1) Memory Merge

from crdt_merge import merge_sorted_stream

for batch in merge_sorted_stream(
    sorted_source_a,          # MUST be sorted by key ascending
    sorted_source_b,          # MUST be sorted by key ascending
    key="id",
    batch_size=5000,
    schema=my_schema,
    verify_order=True,        # Raises ValueError if sources aren't sorted
):
    process(batch)

Memory: O(batch_size). Never loads more than 1 row from each source at a time. Tested to 100M rows at 10.8 MB.

Composable Strategies

from crdt_merge.strategies import (
    MergeSchema, LWW, MaxWins, MinWins, UnionSet,
    Priority, Concat, LongestWins, Custom
)

# Declare per-field strategies
schema = MergeSchema(
    default=LWW(),                                    # Fallback for unspecified fields
    score=MaxWins(),                                   # Highest value wins
    rating=MinWins(),                                  # Lowest value wins
    tags=UnionSet(delimiter=";"),                       # Set union of delimited values
    status=Priority(order=["draft", "review", "live"]), # Priority ranking
    notes=Concat(delimiter="\n"),                       # Concatenate with dedup
    title=LongestWins(),                               # Longer string wins
    custom_field=Custom(fn=my_merge_fn),               # Your own function
)

# Apply to any merge function
result = merge(df_a, df_b, key="id", schema=schema)
merged, log = merge_with_provenance(df_a, df_b, key="id", schema=schema)
for batch in merge_stream(src_a, src_b, key="id", schema=schema):
    ...

# Serialize for storage/transmission
d = schema.to_dict()    # → {"__default__": {"strategy": "LWW"}, "score": {"strategy": "MaxWins"}, ...}
restored = MergeSchema.from_dict(d)

Note: Custom(fn) strategies cannot be serialized — to_dict() stores {"strategy": "Custom"} and from_dict() raises ValueError to prevent silent behavior change.

Delta Sync

from crdt_merge.delta import compute_delta, apply_delta, compose_deltas, DeltaStore

# Compute what changed between two versions
delta = compute_delta(old_records, new_records, key="id")
print(delta.size)  # Number of changes

# Apply delta to bring a remote node up to date
updated = apply_delta(remote_records, delta, key="id")

# Compose multiple deltas: delta(v1→v2) ⊔ delta(v2→v3) = delta(v1→v3)
combined = compose_deltas(delta_1, delta_2, key="id")

# DeltaStore tracks versions automatically
store = DeltaStore(key="id", node_id="node_a")
delta = store.ingest(new_records)  # Returns delta from previous state
print(store.version, store.size)

Note: DeltaStore is in-memory. Use Delta.to_dict() / Delta.from_dict() to persist deltas externally.

Binary Wire Format

from crdt_merge import serialize, deserialize, peek_type, wire_size, serialize_batch, deserialize_batch

# Serialize any CRDT type to compact binary
gc = GCounter("node1")
gc.increment("node1", 100)
wire_bytes = serialize(gc, compress=True)   # zlib compression
restored = deserialize(wire_bytes)          # → GCounter with value 100

# Inspect without deserializing
type_name = peek_type(wire_bytes)           # → "g_counter"
info = wire_size(wire_bytes)                # → {total_bytes, header_bytes, payload_bytes, ...}

# Batch serialize/deserialize
batch_bytes = serialize_batch([gc, pn, lww])
objects = deserialize_batch(batch_bytes)    # → [GCounter, PNCounter, LWWRegister]

Supported types: GCounter, PNCounter, LWWRegister, ORSet, LWWMap, MergeableHLL, MergeableBloom, MergeableCMS, Delta.

Wire format specification: Deterministic byte layout with 12-byte header (magic, version, type, flags, length). Any language implementation that speaks this format can interoperate.

Probabilistic CRDTs

from crdt_merge import MergeableHLL, MergeableBloom, MergeableCMS

# HyperLogLog — estimate unique counts across distributed nodes
hll_a = MergeableHLL(precision=14)
hll_a.add_all(user_ids_node_a)
hll_b = MergeableHLL(precision=14)
hll_b.add_all(user_ids_node_b)
merged = hll_a.merge(hll_b)       # register-max merge
print(f"~{merged.cardinality():.0f} unique users (±0.81%)")

# Bloom filter — distributed membership testing
bloom = MergeableBloom(capacity=1_000_000, fp_rate=0.01)
bloom.add("blocked_ip")
merged = bloom_a.merge(bloom_b)    # bitwise-OR merge

# Count-Min Sketch — distributed frequency estimation
cms = MergeableCMS(width=1000, depth=5)
cms.add("event_type", count=3)
merged = cms_a.merge(cms_b)       # element-wise max merge

All three structures satisfy CRDT merge properties and can be serialized via the wire format.

Verified Merge

from crdt_merge import verified_merge

@verified_merge(samples=100, key="id")
def my_merge(a, b, key="id"):
    return merge(a, b, key=key)

# Calling my_merge() automatically verifies:
# - Commutativity: my_merge(a, b) == my_merge(b, a)
# - Associativity: my_merge(my_merge(a, b), c) == my_merge(a, my_merge(b, c))
# - Idempotency: my_merge(a, a) == a
# Raises CRDTVerificationError if any property fails.

JSON Deep Merge

from crdt_merge import merge_dicts, merge_json_lines

# Deep merge two dicts with LWW semantics
merged = merge_dicts(
    {"user": {"name": "Alice", "score": 80}},
    {"user": {"name": "Alice", "score": 95, "level": 5}},
)
# → {"user": {"name": "Alice", "score": 95, "level": 5}}

# None values in B are treated as missing (A's value preserved)
merged = merge_dicts({"x": 10}, {"x": None})
# → {"x": 10}

# Merge JSONL files line by line
merged_lines = merge_json_lines(jsonl_a, jsonl_b, key="id")

Deduplication

from crdt_merge import dedup, dedup_records, DedupIndex, MinHashDedup

# Exact dedup on strings
unique = dedup(["hello", "world", "hello"])

# Fuzzy dedup on records
unique_records = dedup_records(records, key="title", threshold=0.85)

# MinHash for large-scale approximate dedup
mh = MinHashDedup(num_hashes=128, threshold=0.5)
unique = mh.dedup(items, text_fn=lambda x: x["description"])

# DedupIndex — CRDT-mergeable dedup state
idx_a = DedupIndex("node_a")
idx_a.add_exact("item1")
idx_b = DedupIndex("node_b")
idx_b.add_exact("item2")
merged_idx = idx_a.merge(idx_b)

Benchmarks — A100 Stress Tests

All benchmarks run on NVIDIA A100-SXM4-40GB (89.6 GB RAM, 12 vCPUs) via Google Colab.

v0.7.1 A100 results: 28 code cells, all passed · notebooks/crdt_merge_v071_a100_stress_test.ipynb · Results & graphs · Results & graphs

v0.6.0 A100 results: 78 benchmarks, all passed · benchmarks/a100_v060/

v0.6.0 — Throughput Ceilings (A100)

Operation	Peak Throughput	Scale Tested	Scaling
GCounter increment	4.14M ops/s	10K → 500K	✅ Flat
VectorClock ops	1.06M ops/s	100K → 2M	✅ Flat
Streaming merge	594K rows/s	50K → 1M	🟡 17% degradation
JSON merge (dicts)	530K ops/s	10K → 500K	🟡 Graceful
Gossip updates	474K ops/s	10K → 200K	🟡 State growth
JSON lines merge	456K ops/s	10K → 200K	✅ Nearly flat
HyperLogLog add	433K ops/s	100K → 2M	✅ Flat
Schema evolution	443K ops/s	1K → 20K cols	✅ Stable
Dedup strings	333K ops/s	100K → 2M	🟡 18% degradation
Bloom filter add	178K ops/s	100K → 2M	✅ Flat
Wire serialize batch	170K ops/s	1K → 50K	✅ Flat
MergeSchema merge	149K rows/s	10K → 200K	✅ Improves at scale
Merkle tree build	138K records/s	50K → 1M	🟡 22% degradation
Provenance merge	81K rows/s	50K → 500K	✅ Improves at scale
DataFrame merge	77K rows/s	100K → 10M	✅ 2% degradation

Throughput Scaling Grid

Polars Engine vs Python Merge (v0.7.1 — A100 Measured)

Rows	Polars Engine	Python Engine	Speedup
10,000	0.238s	0.046s	0.2× ⚠️
50,000	0.007s	0.242s	32.8×
100,000	0.012s	0.445s	37.0×
500,000	0.060s	2.3s	38.8× 🏆
1,000,000	0.127s	4.5s	35.2×
5,000,000	1.0s	22.4s	22.5×
10,000,000	2.1s	44.5s	21.4×

⚠️ Below ~15K rows, Polars lazy plan compilation overhead makes Python engine faster. engine="auto" handles this automatically.

Why Polars is faster: The Python engine calls to_pylist() to extract keys for set intersection — that serialization dominates ~90% of runtime. Polars compiles the full outer join + strategy resolution + null coalescing into a single Rust execution plan. Python never touches the data. Zero-copy in/out via Arrow C Data Interface.

Sweet spot: 50K–1M rows (33–39× speedup). Above 5M rows, speedup tapers to ~21× as memory bandwidth saturates. Below 15K rows, Python engine is faster due to Polars compilation overhead.

Opt in: pip install crdt-merge[fast] — falls back to Python engine automatically if Polars not installed.

Arrow vs Pandas Merge Performance

Rows	Arrow	Pandas	Speedup
500,000	2.1s	5.3s	2.53×
5,000,000	21.3s	54.1s	2.55×
50,000,000	222.5s	550.6s	2.47×

Consistent 2.5× speedup at all scales. Arrow optimizes columnar data movement; per-field strategy resolution remains in Python. End-to-end 38.8× peak achieved in v0.7.1 by pushing strategy resolution into Polars (Rust). See above.

Arrow vs Pandas

Key Findings

GCounter and VectorClock are the fastest primitives — over 1M ops/s with zero degradation at scale
DataFrame merge is rock-solid: 77K→75K rows/s from 100K to 10M rows (2% degradation across 100× scale)
MergeSchema and Provenance actually improve at scale as fixed overhead amortizes
Wire protocol: All 14 CRDT types serialize/deserialize correctly (50 B to 8.2 KB per type)
CRDT verification: All 5 tested types pass commutativity, associativity, and idempotency laws
100-node gossip: Converges in 1 round
Parallel merge note: Python GIL + multiprocessing overhead makes parallel slower than sequential for in-process merges. Parallel shines for I/O-bound and multi-source workloads.

v0.7.1 — Key Findings (A100)

Polars engine peak: 8.4M rows/s at 500K rows (38.8× over Python engine) — the entire merge runs in Rust
Python engine: 225K rows/s flat across all scales — 3× improvement over v0.6.0's 77K rows/s baseline
Streaming merge: 1.5M rows/s dead flat from 100K to 5M rows — O(1) memory proven at scale
CRDT primitives: 3.5M ops/s (GCounter) in pure Python — no FFI needed
Sweet spot: 50K–1M rows for Polars (33–39× speedup). Below ~15K rows, Python engine is faster due to Polars lazy plan compilation overhead
Speedup tapers above 5M rows to ~21× as memory bandwidth saturates — still massive but not linear
CRDT laws: 3/3 verified (commutativity, associativity, idempotency) across 500 trials

Previous Versions

v0.3.0 — Core merge 39–42K rows/s, merge_sorted_stream() 1.2M rows/s at O(1) memory (173 measurements). v0.4.0 — Provenance & verification (55 measurements). v0.5.0 — Wire format & probabilistic (50 measurements).

Notebooks: All available in notebooks/ for independent reproduction on Google Colab.

Cross-Language Ports

crdt-merge follows a reference + protocol architecture:

Language	Package	Version	Status
Python (reference)	crdt-merge	v0.8.2	✅ Full feature set + 25 model merge strategies + CRDT architecture + 8 accelerators + Polars engine + Context Memory + Agentic AI
TypeScript	crdt-merge	v0.2.0	Core CRDTs + merge
Rust	crdt-merge	v0.2.0	Core CRDTs + merge
Java	crdt-merge	v0.2.0	Source complete

Architecture: Python is the reference implementation where all innovation starts. The Rust crate will become a thin protocol engine (~1,000 lines) implementing wire format + merge semantics. FFI wrappers (Go, C#, Swift) will wrap the protocol engine. The golden rule: Python serialize → Rust deserialize → merge → serialize → Python deserialize must roundtrip perfectly.

Roadmap

Released

Version	Name	Highlights
v0.1.0	Initial Release	Core CRDTs, merge, diff, dedup, JSON merge, HuggingFace integration
v0.2.0	License Update	BSL-1.1 + patent protection
v0.3.0	The Schema Release	MergeSchema DSL, 8 strategies, streaming merge, delta sync
v0.4.0	The Audit Release	Provenance tracking, @verified_merge, streaming optimizations
v0.5.0	The Protocol Release	Binary wire format, probabilistic CRDTs (HLL, Bloom, CMS)
v0.6.0	The Architecture Release	HLC clocks, schema evolution, Merkle trees, Arrow-native merge, gossip protocol, async/parallel merge, multi-key composites
v0.7.0	The Ecosystem Release	MergeQL, self-merging Parquet, conflict visualization, 8 ecosystem accelerators (DuckDB, dbt, Polars, Arrow Flight, Airbyte, SQLite, Streamlit, DuckLake)
v0.7.1	The Polars Engine Release	Polars merge engine (38.8× on A100), shared `_polars_engine.py`, `engine="auto"` fallback, zero-dependency core preserved
v0.8.0	The Intelligence Release	ModelCRDT — 25 model merge strategies (TIES/DARE/SLERP/LoRA), per-parameter provenance, conflict heatmaps, safety analyzer, federated bridge, GPU acceleration. 1,923 tests.
v0.8.1	The CRDT Architecture Release	Two-layer CRDT architecture — CRDTMergeState with OR-Set semantics, all 25 strategies provably CRDT-compliant, SHA-256 Merkle hashing, version vectors, 195 new tests (2,118 total).
v0.8.2	The Adoption Release	Context Memory System (manifests+sidecars+bloom), Agentic AI State Merge, MergeKit Migration CLI, comprehensive API docs.

Upcoming

Version	Name	Key Features
v0.9.0	The Enterprise Release	UnmergeEngine, EU AI Act compliance report generator (⚠️ Aug 2026 deadline), model unmerging, encryption, RBAC, observability
v1.0.0	The Platform Release	API freeze, formal spec, security audit, cross-language port sync, comprehensive docs

Full roadmap: docs/roadmap/roadmap_v2_0.md

Known Limitations

These are honest constraints of the current version:

Limitation	Details	Planned Fix
Python dict merge path	`merge()` converts DataFrames to list-of-dicts internally. Slow for >1M rows.	✅ Resolved in v0.6.0 — Arrow-native engine (2.5× measured on A100). MergeQL + DuckDB UDF in v0.7.0 for SQL-native path.
No type system	Strategies operate on `Any`. No type checking during merge.	✅ Resolved in v0.6.0 — Schema evolution with column mapping + type coercion
Single-threaded	All operations are synchronous, single-threaded Python.	✅ Resolved in v0.6.0 — Async wrappers + parallel merge execution
Single key column	`merge()` supports one key column only. Composite keys require manual concatenation.	✅ Resolved in v0.6.0 — Multi-key composite merges
No persistence	DeltaStore is in-memory. State lost on process exit. Use `to_dict()`/`from_dict()` to serialize externally.	By design — persistence belongs in the application layer
No networking	Gossip protocol provides state tracking but no built-in transport. Wire format enables interop.	By design — transport belongs in the application layer
O(n²) fuzzy dedup	`_fuzzy_dedup_records` compares every record against all existing records. Unusable above ~10K records.	Algorithmic improvement planned
Wire format doesn't include MergeSchema	You can serialize CRDTs and Deltas, but not MergeSchema over the wire.	Planned for future release

Installation

# Core — zero dependencies
pip install crdt-merge

# With optional dependencies for heavy workloads
pip install crdt-merge[fast]       # Polars merge engine (38.8× on A100)
pip install crdt-merge[pandas]     # pandas DataFrame support
pip install crdt-merge[polars]     # Polars DataFrame support
pip install crdt-merge[datasets]   # HuggingFace Datasets
pip install crdt-merge[model]      # numpy for model merge strategies
pip install crdt-merge[gpu]        # torch for GPU-accelerated model merging
pip install crdt-merge[all]        # Everything
pip install crdt-merge[dev]        # pytest + hypothesis for development

# Ecosystem accelerators (optional — each wraps an external tool)
pip install crdt-merge[duckdb]     # DuckDB UDF + MergeQL
pip install crdt-merge[dbt]        # dbt CRDT models
pip install crdt-merge[flight]     # Arrow Flight merge server
pip install crdt-merge[airbyte]    # Airbyte destination connector
pip install crdt-merge[streamlit]  # Visual merge UI
pip install crdt-merge[sqlite]     # SQLite extension

Requirements: Python 3.9+. No required dependencies. Works on Linux, macOS, Windows.

Test Results

2,118 tests across 60+ test files. 2,118 passed, 0 actual failures. v0.8.1 added 195 new CRDT architecture tests verifying all 25 strategies × 3 CRDT laws.

Test File	Tests	Status
test_core.py	30	✅
test_dataframe.py	14	✅
test_dedup.py	15	✅
test_json_merge.py	10	✅
test_strategies.py	39	✅
test_streaming.py	19	✅
test_provenance.py	24	✅
test_verified_merge.py	10	✅
test_wire.py	40	✅
test_probabilistic.py	42	✅
test_v050_integration.py	157	✅
test_longest_wins.py	11	✅
test_stress_v030.py	8	✅
test_benchmark.py	6	✅
test_clocks.py	40	✅
test_schema_evolution.py	40	✅
test_merkle.py	40	✅
test_arrow.py	40	✅
test_gossip.py	40	✅
test_async_merge.py	40	✅
test_parallel.py	40	✅
test_v060_integration.py	18	✅
test_architect_360_validation.py	5	✅
test_mergeql.py	34	✅
test_parquet.py	32	✅
test_viz.py	16	✅
test_wire_v070.py	35	✅
test_accelerator_duckdb.py	34	✅
test_accelerator_dbt.py	42	✅
test_accelerator_ducklake.py	38	✅
test_accelerator_polars.py	36	✅
test_accelerator_flight.py	43	✅
test_accelerator_airbyte.py	47	✅
test_accelerator_sqlite.py	44	✅
test_accelerator_streamlit.py	38	✅
test_multi_key.py	8	✅
test_polars_engine.py	30	✅
test_crdt_merge_state.py	95	✅
test_crdt_architecture.py	100	✅

Version history:

Version	Tests	Growth
v0.1.0	45	—
v0.2.0	88	+43
v0.3.0	133	+45
v0.4.0	277	+144
v0.5.0	425	+148
v0.6.0	720	+295
v0.7.0	1,114	+394
v0.7.1	1,148	+34
v0.8.0	1,923	+775
v0.8.1	2,118	+195

Full details: TEST_RESULTS.md

Project Stats

Metric	Value
Core modules	51 (24 tabular + 20 model + 2 CRDT architecture + 5 context + 1 agentic + 2 CLI)
Ecosystem accelerators	8
Model merge strategies	25
Source lines	~34,000
Test files	60+
Tests passing	2,118+
Dependencies	0 (required)
Merge domains	Tabular data, Model weights, Agent memory, CRDT-verified model merging
Python versions	3.9, 3.10, 3.11, 3.12
License	BSL-1.1

📚 Documentation

API Reference — Comprehensive API reference for all modules
CRDT Architecture Analysis — Deep dive into the two-layer CRDT architecture
Roadmap v2.0 — Full development roadmap
Benchmark Results — v0.8.1 performance benchmarks
CHANGELOG — Full version history

Contributing

We welcome contributions. Please see CLA.md before submitting pull requests.

Commercial licensing & inquiries:

License

Licensed under the Business Source License 1.1 (BSL 1.1). See LICENSE for the full text.

Patent grant included — see PATENTS.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.7

Apr 17, 2026

0.9.5

Apr 9, 2026

0.9.4

Apr 2, 2026

0.9.3

Apr 2, 2026

0.9.2

Mar 30, 2026

0.9.1

Mar 30, 2026

0.9.0

Mar 30, 2026

0.8.3

Mar 30, 2026

This version

0.8.2

Mar 29, 2026

0.8.1

Mar 29, 2026

0.8.0

Mar 29, 2026

0.7.2

Mar 29, 2026

0.7.1

Mar 28, 2026

0.7.0

Mar 28, 2026

0.6.0

Mar 28, 2026

0.5.0

Mar 27, 2026

0.4.0

Mar 27, 2026

0.3.0

Mar 26, 2026

0.2.0

Mar 26, 2026

0.1.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crdt_merge-0.8.2.tar.gz (1.3 MB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crdt_merge-0.8.2-py3-none-any.whl (267.4 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file crdt_merge-0.8.2.tar.gz.

File metadata

Download URL: crdt_merge-0.8.2.tar.gz
Upload date: Mar 29, 2026
Size: 1.3 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for crdt_merge-0.8.2.tar.gz
Algorithm	Hash digest
SHA256	`dfb681831d5708edba681c74d546079a107d1be49ccc46357f865f8270407e87`
MD5	`643bf7b2f31b7a9634e43f9de87f3dce`
BLAKE2b-256	`c5078c8b3bee7e2b2b6e7aa898bd2e1f19fe1c911d88c12945edca5f81ddd852`

See more details on using hashes here.

File details

Details for the file crdt_merge-0.8.2-py3-none-any.whl.

File metadata

Download URL: crdt_merge-0.8.2-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 267.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for crdt_merge-0.8.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b8fe986e836133ba5e8c748f07185c93ab65e1058019f5905aaad8c31e13d8eb`
MD5	`f2208d7f34bf1c2e9e0159d571e643fb`
BLAKE2b-256	`6bddd71b66eb8cb2d2ac5624a7dc3673daf7050c54ea5e379958fc59f3cb01f2`

See more details on using hashes here.

crdt-merge 0.8.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

crdt-merge

🆕 What's New

v0.8.2 — "The Adoption Release" (2026-03-29)

🧠 Context Memory System (Category-Defining)

🤖 Agentic AI State Merge

🔧 MergeKit Migration CLI

v0.8.1 — The CRDT Architecture Release (2026-03-29)

v0.8.0 — The Intelligence Release (2026-03-29)

Previous: v0.7.1 — "The Polars Engine Release" (2026-03-28)

Previous: v0.7.0 — "The Ecosystem Release" (2026-03-28)

Previous: v0.6.0 — "The Architecture Release"

What is crdt-merge?

What it does

What it is NOT

Who is it for?

Quick Start

1. Basic Merge

2. Per-Field Strategies with MergeSchema

3. Streaming Merge at Scale

4. CRDT Primitives

5. MergeQL — SQL-Like Merges (v0.7.0)

6. Self-Merging Parquet (v0.7.0)

Feature Matrix

API Reference

merge() — The Main Entry Point

merge_with_provenance() — Merge + Full Audit Trail

merge_stream() — Streaming Merge

merge_sorted_stream() — True O(1) Memory Merge

Composable Strategies

Delta Sync

Binary Wire Format

Probabilistic CRDTs

Verified Merge

JSON Deep Merge

Deduplication

Benchmarks — A100 Stress Tests

v0.6.0 — Throughput Ceilings (A100)

Polars Engine vs Python Merge (v0.7.1 — A100 Measured)

Arrow vs Pandas Merge Performance

Key Findings

v0.7.1 — Key Findings (A100)

Previous Versions

Cross-Language Ports

Roadmap

Released

Upcoming

Known Limitations

Installation

Test Results

Project Stats

📚 Documentation

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`merge()` — The Main Entry Point

`merge_with_provenance()` — Merge + Full Audit Trail

`merge_stream()` — Streaming Merge

`merge_sorted_stream()` — True O(1) Memory Merge