The Merge Algebra Toolkit — composable, streaming, verified CRDT merge for datasets
Project description
crdt-merge
Conflict-free merge for structured data. Define strategies. Merge datasets. Prove correctness. Audit every field. Stream at any scale. Zero dependencies.
What's New in v0.7.0
v0.7.0 — "The Ecosystem Release" (2026-03-28)
14 new modules. 8 ecosystem accelerators. 1,114 tests. Zero regressions.
- MergeQL — SQL-like interface for CRDT merges:
MERGE t1, t2 ON id STRATEGY score='max' - Self-Merging Parquet — Parquet files with embedded CRDT metadata that merge themselves
- Conflict Topology Visualization — heatmaps, temporal analysis, cluster detection, D3-compatible JSON export
- Wire Protocol v3 — support for all new v0.7.0 types
8 Ecosystem Accelerators
| # | Accelerator | What It Does |
|---|---|---|
| 🦆 | DuckDB UDF | Register CRDT merge as native DuckDB SQL functions |
| 🔧 | dbt Package | CRDT merge as dbt models — merge in your warehouse |
| 🦆 | DuckLake | Semantic conflict detection for DuckLake catalogs |
| 🐻 | Polars Plugin | Native Polars expression plugin for CRDT ops |
| ✈️ | Arrow Flight | Merge-as-a-service over Arrow Flight RPC |
| 🔌 | Airbyte | CRDT-aware Airbyte destination connector |
| 💾 | SQLite Extension | CRDT merge as SQLite custom functions |
| 📊 | Streamlit | Visual merge UI — drag, drop, resolve conflicts |
Previous: v0.6.0 — "The Architecture Release"
- 7 new modules: clocks, schema evolution, merkle, arrow, gossip, async_merge, parallel
- Arrow-native merge engine — 2.5× measured speedup on A100 (50M rows)
- 720 tests, zero regressions
What is crdt-merge?
crdt-merge is a Python library for merging datasets with conflict resolution. Instead of "last write wins" or manual dedup scripts, you declare per-field merge strategies — and the library guarantees deterministic, order-independent results with full audit trails.
pip install crdt-merge
What it does
- Merges datasets with configurable per-field strategies (max wins, min wins, union, priority, custom)
- Streams merges at any scale — O(1) memory for sorted sources, tested to 100M rows
- Proves correctness —
@verified_mergedecorator verifies commutativity, associativity, idempotency - Audits everything — per-field provenance trails show exactly which source won each field and why
- Serializes for the wire — compact binary format for cross-language CRDT exchange
- Speaks SQL — MergeQL lets you express merges as SQL statements
- Plugs into everything — DuckDB, dbt, Polars, Arrow Flight, Airbyte, SQLite, Streamlit accelerators
- Zero dependencies — pure Python core, embeds anywhere
What it is NOT
- Not a real-time collaboration tool. For collaborative text editing, see Yjs or Loro.
- Not a database. No persistence, no queries, no networking. It's a library.
- Not a distributed system. Includes gossip state tracking and Merkle sync primitives (v0.6.0), but no built-in networking or consensus. It provides the building blocks that distributed systems can use.
Who is it for?
Anyone merging structured data from multiple sources: data pipelines, ETL, multi-node sync, offline-first apps, federated datasets. Your real alternative today is pandas.merge() (no conflict resolution) or hand-written dedup scripts.
Quick Start
1. Basic Merge
from crdt_merge import merge
# Two datasets with conflicting scores for the same person
dataset_a = [{"id": 1, "name": "Alice", "score": 100}]
dataset_b = [{"id": 1, "name": "Alice", "score": 150}]
result = merge(dataset_a, dataset_b, key="id")
# → [{"id": 1, "name": "Alice", "score": 150}]
By default, when two rows share the same key, the second dataset wins ties. You can control this with prefer="a", prefer="b", or prefer="latest" (uses timestamps).
2. Per-Field Strategies with MergeSchema
The real power is declaring different strategies for different fields:
from crdt_merge import merge
from crdt_merge.strategies import MergeSchema, MaxWins, LWW, UnionSet
schema = MergeSchema(
default=LWW(), # Most fields: last-writer-wins
score=MaxWins(), # Scores: highest value wins
tags=UnionSet() # Tags: merge as set union
)
a = [{"id": 1, "score": 80, "tags": "python;data", "status": "draft"}]
b = [{"id": 1, "score": 95, "tags": "python;ml", "status": "published"}]
result = merge(a, b, key="id", schema=schema)
# score: 95 (MaxWins), tags: "data;ml;python" (UnionSet), status: "published" (LWW)
8 built-in strategies: LWW, MaxWins, MinWins, UnionSet, Priority, Concat, LongestWins, Custom(fn)
3. Streaming Merge at Scale
from crdt_merge import merge_sorted_stream
# Merge two sorted sources with O(1) memory — works with generators from disk/network
def source_a():
for i in range(10_000_000):
yield {"id": i, "value": f"a_{i}"}
def source_b():
for i in range(0, 10_000_000, 2):
yield {"id": i, "value": f"b_{i}"}
for batch in merge_sorted_stream(source_a(), source_b(), key="id", batch_size=10000):
process(batch) # Memory never exceeds batch_size
4. CRDT Primitives
from crdt_merge import GCounter, PNCounter, LWWRegister, ORSet, LWWMap
# Grow-only counter — nodes increment independently, merge via max-per-node
a = GCounter("node_a")
a.increment("node_a", 10)
b = GCounter("node_b")
b.increment("node_b", 5)
merged = a.merge(b)
print(merged.value) # 15 — guaranteed correct regardless of merge order
All primitives satisfy CRDT properties: commutative (a ⊔ b = b ⊔ a), associative ((a ⊔ b) ⊔ c = a ⊔ (b ⊔ c)), idempotent (a ⊔ a = a).
5. MergeQL — SQL-Like Merges (v0.7.0)
from crdt_merge.mergeql import MergeQL
ql = MergeQL()
ql.register("nyc", [{"id": 1, "name": "Alice", "salary": 100000}])
ql.register("london", [{"id": 1, "name": "Alice", "salary": 120000}])
result = ql.execute("""
MERGE nyc, london
ON id
STRATEGY salary='max', name='lww'
""")
# salary: 120000 (max wins), name: "Alice" (LWW)
6. Self-Merging Parquet (v0.7.0)
from crdt_merge.parquet import SelfMergingParquet
from crdt_merge.strategies import MergeSchema, LWW, MaxWins
schema = MergeSchema(default=LWW(), salary=MaxWins())
smf = SelfMergingParquet("customers", key="id", schema=schema)
smf.ingest([{"id": 1, "name": "Alice", "salary": 100}])
smf.ingest([{"id": 1, "name": "Alicia", "salary": 120}])
assert smf.read()[0]["salary"] == 120 # MaxWins applied automatically
Feature Matrix
| Module | Feature | Since | Description |
|---|---|---|---|
core |
GCounter, PNCounter, LWWRegister, ORSet, LWWMap | v0.1.0 | CRDT primitives with merge, serialize, deserialize |
dataframe |
merge(), diff() |
v0.1.0 | Dataset merge with key matching, conflict resolution, schema support |
dedup |
dedup_list(), dedup_records(), MinHashDedup |
v0.1.0 | Exact, fuzzy (bigram), and MinHash deduplication |
json_merge |
merge_dicts(), merge_json_lines() |
v0.1.0 | Deep dict merge with LWW, None-as-missing semantics |
strategies |
MergeSchema + 8 strategies | v0.3.0 | Per-field strategy DSL: LWW, MaxWins, MinWins, UnionSet, Priority, Concat, LongestWins, Custom |
streaming |
merge_stream(), merge_sorted_stream() |
v0.3.0 | O(batch_size) and O(1) memory streaming merge |
delta |
Delta, DeltaStore, compute_delta(), compose_deltas() |
v0.3.0 | Delta-state CRDT sync — compute, apply, compose deltas |
provenance |
merge_with_provenance(), ProvenanceLog |
v0.4.0 | Per-field audit trail: which source won, which strategy, why |
verify |
@verified_merge decorator |
v0.4.0 | Property-based testing of commutativity, associativity, idempotency |
wire |
serialize(), deserialize(), serialize_batch() |
v0.5.0 | Compact binary wire format for all CRDT types |
probabilistic |
MergeableHLL, MergeableBloom, MergeableCMS | v0.5.0 | Probabilistic data structures with CRDT merge semantics |
datasets_ext |
merge_datasets(), dedup_dataset() |
v0.1.0 | HuggingFace Datasets integration (optional) |
clocks |
HybridLogicalClock, HLC timestamps | v0.6.0 | Hybrid Logical Clocks for distributed CRDT ordering |
schema_evolution |
Column mapping, type coercion | v0.6.0 | Automatic schema evolution for mismatched datasets |
merkle |
MerkleHashTree, diff, sync | v0.6.0 | Merkle hash trees for efficient incremental sync |
arrow |
Arrow-native merge engine | v0.6.0 | Apache Arrow-native merge path (2.5× measured on A100) |
gossip |
GossipState, anti-entropy protocol | v0.6.0 | Gossip protocol state tracking for convergence |
async_merge |
async_merge(), async_stream() |
v0.6.0 | Async/await wrappers for non-blocking merges |
parallel |
parallel_merge(), multi-core execution |
v0.6.0 | Parallel merge execution across multiple cores |
mergeql |
MergeQL SQL interface | v0.7.0 | SQL-like CRDT merge: MERGE t1, t2 ON id STRATEGY score='max' |
parquet |
SelfMergingParquet | v0.7.0 | Parquet files with embedded CRDT metadata that self-merge |
viz |
ConflictTopology, heatmaps | v0.7.0 | Conflict analysis: heatmaps, temporal patterns, cluster detection |
Ecosystem Accelerators (v0.7.0):
| Accelerator | Module | Description |
|---|---|---|
| 🦆 DuckDB UDF | accelerators.duckdb_udf |
CRDT merge as native DuckDB SQL functions |
| 🔧 dbt Package | accelerators.dbt_package |
CRDT merge models for dbt-managed warehouses |
| 🦆 DuckLake | accelerators.ducklake |
Semantic conflict detection for DuckLake catalogs |
| 🐻 Polars Plugin | accelerators.polars_plugin |
Native Polars expressions for CRDT operations |
| ✈️ Arrow Flight | accelerators.flight_server |
Merge-as-a-service over Arrow Flight RPC |
| 🔌 Airbyte | accelerators.airbyte |
CRDT-aware Airbyte destination connector |
| 💾 SQLite Extension | accelerators.sqlite_ext |
CRDT merge as SQLite custom functions |
| 📊 Streamlit UI | accelerators.streamlit_ui |
Visual merge interface with conflict resolution |
23 core modules + 8 ecosystem accelerators, ~17,200 lines of source, zero required dependencies.
API Reference
merge() — The Main Entry Point
from crdt_merge import merge
result = merge(
df_a, # First dataset (list of dicts, pandas DataFrame, or Polars DataFrame)
df_b, # Second dataset
key="id", # Column to match rows on (raises KeyError if not found)
prefer="latest", # "a", "b", or "latest" — conflict resolution (raises ValueError if invalid)
schema=None, # Optional MergeSchema for per-field strategies (overrides prefer)
timestamp_col=None, # Column with timestamps for LWW resolution
dedup=True, # Remove exact duplicates in output
fuzzy_dedup=False, # Also remove near-duplicates
fuzzy_threshold=0.85, # Similarity threshold for fuzzy dedup
)
- When
keyis provided: rows with matching keys are merged, unique rows from both sides are preserved. - When
keyisNone: datasets are appended and deduplication is applied. - When
schemais provided: per-field strategies override thepreferparameter for matched rows. - Input/output format matches: pass pandas in → get pandas out. Pass list of dicts → get list of dicts.
merge_with_provenance() — Merge + Full Audit Trail
from crdt_merge import merge_with_provenance
merged, log = merge_with_provenance(
df_a, df_b,
key="id",
schema=my_schema, # Optional MergeSchema
timestamp_col=None, # Optional timestamp column
as_dataframe=False, # Set True to get pandas DataFrame output
)
# Inspect the audit trail
print(log.summary()) # Human-readable summary
print(log.total_conflicts) # Number of field-level conflicts resolved
for entry in log.entries:
print(entry.field, entry.winner, entry.strategy)
# Export
from crdt_merge import export_provenance
json_str = export_provenance(log, format="json") # Returns string
csv_str = export_provenance(log, format="csv") # Returns string
log_dict = log.to_dict() # Returns dict
merge_stream() — Streaming Merge
from crdt_merge import merge_stream, StreamStats
stats = StreamStats()
for batch in merge_stream(
source_a, # Iterable of dicts (streamed)
source_b, # Iterable of dicts (loaded into memory)
key="id",
batch_size=5000,
schema=my_schema, # Optional MergeSchema
prefer="b", # Optional prefer shorthand
timestamp_col=None,
stats=stats,
):
process(batch)
print(f"{stats.rows_per_second:.0f} rows/s, {stats.batch_count} batches")
Memory: O(|source_b| + batch_size). Loads source_b fully for key lookup, streams source_a in batches.
merge_sorted_stream() — True O(1) Memory Merge
from crdt_merge import merge_sorted_stream
for batch in merge_sorted_stream(
sorted_source_a, # MUST be sorted by key ascending
sorted_source_b, # MUST be sorted by key ascending
key="id",
batch_size=5000,
schema=my_schema,
verify_order=True, # Raises ValueError if sources aren't sorted
):
process(batch)
Memory: O(batch_size). Never loads more than 1 row from each source at a time. Tested to 100M rows at 10.8 MB.
Composable Strategies
from crdt_merge.strategies import (
MergeSchema, LWW, MaxWins, MinWins, UnionSet,
Priority, Concat, LongestWins, Custom
)
# Declare per-field strategies
schema = MergeSchema(
default=LWW(), # Fallback for unspecified fields
score=MaxWins(), # Highest value wins
rating=MinWins(), # Lowest value wins
tags=UnionSet(delimiter=";"), # Set union of delimited values
status=Priority(order=["draft", "review", "live"]), # Priority ranking
notes=Concat(delimiter="\n"), # Concatenate with dedup
title=LongestWins(), # Longer string wins
custom_field=Custom(fn=my_merge_fn), # Your own function
)
# Apply to any merge function
result = merge(df_a, df_b, key="id", schema=schema)
merged, log = merge_with_provenance(df_a, df_b, key="id", schema=schema)
for batch in merge_stream(src_a, src_b, key="id", schema=schema):
...
# Serialize for storage/transmission
d = schema.to_dict() # → {"__default__": {"strategy": "LWW"}, "score": {"strategy": "MaxWins"}, ...}
restored = MergeSchema.from_dict(d)
Note: Custom(fn) strategies cannot be serialized — to_dict() stores {"strategy": "Custom"} and from_dict() raises ValueError to prevent silent behavior change.
Delta Sync
from crdt_merge.delta import compute_delta, apply_delta, compose_deltas, DeltaStore
# Compute what changed between two versions
delta = compute_delta(old_records, new_records, key="id")
print(delta.size) # Number of changes
# Apply delta to bring a remote node up to date
updated = apply_delta(remote_records, delta, key="id")
# Compose multiple deltas: delta(v1→v2) ⊔ delta(v2→v3) = delta(v1→v3)
combined = compose_deltas(delta_1, delta_2, key="id")
# DeltaStore tracks versions automatically
store = DeltaStore(key="id", node_id="node_a")
delta = store.ingest(new_records) # Returns delta from previous state
print(store.version, store.size)
Note: DeltaStore is in-memory. Use Delta.to_dict() / Delta.from_dict() to persist deltas externally.
Binary Wire Format
from crdt_merge import serialize, deserialize, peek_type, wire_size, serialize_batch, deserialize_batch
# Serialize any CRDT type to compact binary
gc = GCounter("node1")
gc.increment("node1", 100)
wire_bytes = serialize(gc, compress=True) # zlib compression
restored = deserialize(wire_bytes) # → GCounter with value 100
# Inspect without deserializing
type_name = peek_type(wire_bytes) # → "g_counter"
info = wire_size(wire_bytes) # → {total_bytes, header_bytes, payload_bytes, ...}
# Batch serialize/deserialize
batch_bytes = serialize_batch([gc, pn, lww])
objects = deserialize_batch(batch_bytes) # → [GCounter, PNCounter, LWWRegister]
Supported types: GCounter, PNCounter, LWWRegister, ORSet, LWWMap, MergeableHLL, MergeableBloom, MergeableCMS, Delta.
Wire format specification: Deterministic byte layout with 12-byte header (magic, version, type, flags, length). Any language implementation that speaks this format can interoperate.
Probabilistic CRDTs
from crdt_merge import MergeableHLL, MergeableBloom, MergeableCMS
# HyperLogLog — estimate unique counts across distributed nodes
hll_a = MergeableHLL(precision=14)
hll_a.add_all(user_ids_node_a)
hll_b = MergeableHLL(precision=14)
hll_b.add_all(user_ids_node_b)
merged = hll_a.merge(hll_b) # register-max merge
print(f"~{merged.cardinality():.0f} unique users (±0.81%)")
# Bloom filter — distributed membership testing
bloom = MergeableBloom(capacity=1_000_000, fp_rate=0.01)
bloom.add("blocked_ip")
merged = bloom_a.merge(bloom_b) # bitwise-OR merge
# Count-Min Sketch — distributed frequency estimation
cms = MergeableCMS(width=1000, depth=5)
cms.add("event_type", count=3)
merged = cms_a.merge(cms_b) # element-wise max merge
All three structures satisfy CRDT merge properties and can be serialized via the wire format.
Verified Merge
from crdt_merge import verified_merge
@verified_merge(samples=100, key="id")
def my_merge(a, b, key="id"):
return merge(a, b, key=key)
# Calling my_merge() automatically verifies:
# - Commutativity: my_merge(a, b) == my_merge(b, a)
# - Associativity: my_merge(my_merge(a, b), c) == my_merge(a, my_merge(b, c))
# - Idempotency: my_merge(a, a) == a
# Raises CRDTVerificationError if any property fails.
JSON Deep Merge
from crdt_merge import merge_dicts, merge_json_lines
# Deep merge two dicts with LWW semantics
merged = merge_dicts(
{"user": {"name": "Alice", "score": 80}},
{"user": {"name": "Alice", "score": 95, "level": 5}},
)
# → {"user": {"name": "Alice", "score": 95, "level": 5}}
# None values in B are treated as missing (A's value preserved)
merged = merge_dicts({"x": 10}, {"x": None})
# → {"x": 10}
# Merge JSONL files line by line
merged_lines = merge_json_lines(jsonl_a, jsonl_b, key="id")
Deduplication
from crdt_merge import dedup, dedup_records, DedupIndex, MinHashDedup
# Exact dedup on strings
unique = dedup(["hello", "world", "hello"])
# Fuzzy dedup on records
unique_records = dedup_records(records, key="title", threshold=0.85)
# MinHash for large-scale approximate dedup
mh = MinHashDedup(num_hashes=128, threshold=0.5)
unique = mh.dedup(items, text_fn=lambda x: x["description"])
# DedupIndex — CRDT-mergeable dedup state
idx_a = DedupIndex("node_a")
idx_a.add_exact("item1")
idx_b = DedupIndex("node_b")
idx_b.add_exact("item2")
merged_idx = idx_a.merge(idx_b)
Benchmarks — A100 Stress Tests
All benchmarks run on NVIDIA A100-SXM4-40GB (83.5 GB RAM, 12 vCPUs) via Google Colab. 78 benchmarks, all passed.
Full data: benchmarks/a100_v060/ · Reproduction notebook: notebooks/v060_a100_stress_test.ipynb
v0.6.0 — Throughput Ceilings (A100)
| Operation | Peak Throughput | Scale Tested | Scaling |
|---|---|---|---|
| GCounter increment | 4.14M ops/s | 10K → 500K | ✅ Flat |
| VectorClock ops | 1.06M ops/s | 100K → 2M | ✅ Flat |
| Streaming merge | 594K rows/s | 50K → 1M | 🟡 17% degradation |
| JSON merge (dicts) | 530K ops/s | 10K → 500K | 🟡 Graceful |
| Gossip updates | 474K ops/s | 10K → 200K | 🟡 State growth |
| JSON lines merge | 456K ops/s | 10K → 200K | ✅ Nearly flat |
| HyperLogLog add | 433K ops/s | 100K → 2M | ✅ Flat |
| Schema evolution | 443K ops/s | 1K → 20K cols | ✅ Stable |
| Dedup strings | 333K ops/s | 100K → 2M | 🟡 18% degradation |
| Bloom filter add | 178K ops/s | 100K → 2M | ✅ Flat |
| Wire serialize batch | 170K ops/s | 1K → 50K | ✅ Flat |
| MergeSchema merge | 149K rows/s | 10K → 200K | ✅ Improves at scale |
| Merkle tree build | 138K records/s | 50K → 1M | 🟡 22% degradation |
| Provenance merge | 81K rows/s | 50K → 500K | ✅ Improves at scale |
| DataFrame merge | 77K rows/s | 100K → 10M | ✅ 2% degradation |
Arrow vs Pandas Merge Performance
| Rows | Arrow | Pandas | Speedup |
|---|---|---|---|
| 500,000 | 2.1s | 5.3s | 2.53× |
| 5,000,000 | 21.3s | 54.1s | 2.55× |
| 50,000,000 | 222.5s | 550.6s | 2.47× |
Consistent 2.5× speedup at all scales. Arrow optimizes columnar data movement; per-field strategy resolution remains in Python. End-to-end 10–50× requires pushing strategy resolution into native code (planned for Rust protocol engine).
Key Findings
- GCounter and VectorClock are the fastest primitives — over 1M ops/s with zero degradation at scale
- DataFrame merge is rock-solid: 77K→75K rows/s from 100K to 10M rows (2% degradation across 100× scale)
- MergeSchema and Provenance actually improve at scale as fixed overhead amortizes
- Wire protocol: All 14 CRDT types serialize/deserialize correctly (50 B to 8.2 KB per type)
- CRDT verification: All 5 tested types pass commutativity, associativity, and idempotency laws
- 100-node gossip: Converges in 1 round
- Parallel merge note: Python GIL + multiprocessing overhead makes parallel slower than sequential for in-process merges. Parallel shines for I/O-bound and multi-source workloads.
Previous Versions
v0.3.0 — Core merge 39–42K rows/s, merge_sorted_stream() 1.2M rows/s at O(1) memory (173 measurements).
v0.4.0 — Provenance & verification (55 measurements). v0.5.0 — Wire format & probabilistic (50 measurements).
Notebooks: All available in notebooks/ for independent reproduction on Google Colab.
Cross-Language Ports
crdt-merge follows a reference + protocol architecture:
| Language | Package | Version | Status |
|---|---|---|---|
| Python (reference) | crdt-merge | v0.7.0 | ✅ Full feature set + 8 accelerators |
| TypeScript | crdt-merge | v0.2.0 | Core CRDTs + merge |
| Rust | crdt-merge | v0.2.0 | Core CRDTs + merge |
| Java | crdt-merge | v0.2.0 | Source complete |
Architecture: Python is the reference implementation where all innovation starts. The Rust crate will become a thin protocol engine (~1,000 lines) implementing wire format + merge semantics. FFI wrappers (Go, C#, Swift) will wrap the protocol engine. The golden rule: Python serialize → Rust deserialize → merge → serialize → Python deserialize must roundtrip perfectly.
Roadmap
Released
| Version | Name | Highlights |
|---|---|---|
| v0.1.0 | Initial Release | Core CRDTs, merge, diff, dedup, JSON merge, HuggingFace integration |
| v0.2.0 | License Update | Apache-2.0 + patent protection |
| v0.3.0 | The Schema Release | MergeSchema DSL, 8 strategies, streaming merge, delta sync |
| v0.4.0 | The Audit Release | Provenance tracking, @verified_merge, streaming optimizations |
| v0.5.0 | The Protocol Release | Binary wire format, probabilistic CRDTs (HLL, Bloom, CMS) |
| v0.6.0 | The Architecture Release | HLC clocks, schema evolution, Merkle trees, Arrow-native merge, gossip protocol, async/parallel merge, multi-key composites |
| v0.7.0 | The Ecosystem Release | MergeQL, self-merging Parquet, conflict visualization, 8 ecosystem accelerators (DuckDB, dbt, Polars, Arrow Flight, Airbyte, SQLite, Streamlit, DuckLake) |
Upcoming
| Version | Name | Key Features |
|---|---|---|
| v0.8.0 | The AI Release | ModelCRDT — AI model merging with 25 strategies (TIES/DARE/SLERP/LoRA), provenance tracking, evolutionary merge |
| v0.9.0 | The Compliance Release | UnmergeEngine — reversible CRDT merge for GDPR erasure, parallel merge |
| v1.0.0 | The Platform Release | API freeze, cross-language port sync, full documentation, production certification |
Full roadmap: docs/roadmap/roadmap_v2_0.md
Known Limitations
These are honest constraints of the current version:
| Limitation | Details | Planned Fix |
|---|---|---|
| Python dict merge path | merge() converts DataFrames to list-of-dicts internally. Slow for >1M rows. |
✅ Resolved in v0.6.0 — Arrow-native engine (2.5× measured on A100). MergeQL + DuckDB UDF in v0.7.0 for SQL-native path. |
| No type system | Strategies operate on Any. No type checking during merge. |
✅ Resolved in v0.6.0 — Schema evolution with column mapping + type coercion |
| Single-threaded | All operations are synchronous, single-threaded Python. | ✅ Resolved in v0.6.0 — Async wrappers + parallel merge execution |
| Single key column | merge() supports one key column only. Composite keys require manual concatenation. |
✅ Resolved in v0.6.0 — Multi-key composite merges |
| No persistence | DeltaStore is in-memory. State lost on process exit. Use to_dict()/from_dict() to serialize externally. |
By design — persistence belongs in the application layer |
| No networking | Gossip protocol provides state tracking but no built-in transport. Wire format enables interop. | By design — transport belongs in the application layer |
| O(n²) fuzzy dedup | _fuzzy_dedup_records compares every record against all existing records. Unusable above ~10K records. |
Algorithmic improvement planned |
| Wire format doesn't include MergeSchema | You can serialize CRDTs and Deltas, but not MergeSchema over the wire. | Planned for future release |
Installation
# Core — zero dependencies
pip install crdt-merge
# With optional dependencies for heavy workloads
pip install crdt-merge[fast] # orjson + xxhash
pip install crdt-merge[pandas] # pandas DataFrame support
pip install crdt-merge[polars] # Polars DataFrame support
pip install crdt-merge[datasets] # HuggingFace Datasets
pip install crdt-merge[all] # Everything
pip install crdt-merge[dev] # pytest + hypothesis for development
# Ecosystem accelerators (optional — each wraps an external tool)
pip install crdt-merge[duckdb] # DuckDB UDF + MergeQL
pip install crdt-merge[dbt] # dbt CRDT models
pip install crdt-merge[flight] # Arrow Flight merge server
pip install crdt-merge[airbyte] # Airbyte destination connector
pip install crdt-merge[streamlit] # Visual merge UI
pip install crdt-merge[sqlite] # SQLite extension
Requirements: Python 3.9+. No required dependencies. Works on Linux, macOS, Windows.
Test Results
1,114 tests across 37 test files. 1,114 passed, 3 expected failures (module count assertions), 0 actual failures.
| Test File | Tests | Status |
|---|---|---|
| test_core.py | 30 | ✅ |
| test_dataframe.py | 14 | ✅ |
| test_dedup.py | 15 | ✅ |
| test_json_merge.py | 10 | ✅ |
| test_strategies.py | 39 | ✅ |
| test_streaming.py | 19 | ✅ |
| test_provenance.py | 24 | ✅ |
| test_verified_merge.py | 10 | ✅ |
| test_wire.py | 40 | ✅ |
| test_probabilistic.py | 42 | ✅ |
| test_v050_integration.py | 157 | ✅ |
| test_longest_wins.py | 11 | ✅ |
| test_stress_v030.py | 8 | ✅ |
| test_benchmark.py | 6 | ✅ |
| test_clocks.py | 40 | ✅ |
| test_schema_evolution.py | 40 | ✅ |
| test_merkle.py | 40 | ✅ |
| test_arrow.py | 40 | ✅ |
| test_gossip.py | 40 | ✅ |
| test_async_merge.py | 40 | ✅ |
| test_parallel.py | 40 | ✅ |
| test_v060_integration.py | 18 | ✅ |
| test_architect_360_validation.py | 5 | ✅ |
| test_mergeql.py | 34 | ✅ |
| test_parquet.py | 32 | ✅ |
| test_viz.py | 16 | ✅ |
| test_wire_v070.py | 35 | ✅ |
| test_accelerator_duckdb.py | 34 | ✅ |
| test_accelerator_dbt.py | 42 | ✅ |
| test_accelerator_ducklake.py | 38 | ✅ |
| test_accelerator_polars.py | 36 | ✅ |
| test_accelerator_flight.py | 43 | ✅ |
| test_accelerator_airbyte.py | 47 | ✅ |
| test_accelerator_sqlite.py | 44 | ✅ |
| test_accelerator_streamlit.py | 38 | ✅ |
| test_multi_key.py | 8 | ✅ |
Version history:
| Version | Tests | Growth |
|---|---|---|
| v0.1.0 | 45 | — |
| v0.2.0 | 88 | +43 |
| v0.3.0 | 133 | +45 |
| v0.4.0 | 277 | +144 |
| v0.5.0 | 425 | +148 |
| v0.6.0 | 720 | +295 |
| v0.7.0 | 1,114 | +394 |
Full details: TEST_RESULTS.md
Project Stats
| Metric | Value |
|---|---|
| Core modules | 23 |
| Ecosystem accelerators | 8 |
| Source lines | ~17,200 |
| Test files | 37 |
| Tests passing | 1,114 |
| Test lines | ~12,500 |
| Test:source ratio | ~0.73:1 |
| Dependencies | 0 (required) |
| Python versions | 3.9, 3.10, 3.11, 3.12 |
| License | Apache-2.0 |
Contributing
We welcome contributions. Please see CLA.md before submitting pull requests.
Commercial licensing & inquiries:
License
Copyright 2026 Ryan Gillespie
Licensed under the Apache License, Version 2.0. See LICENSE for the full text.
Patent grant included — see PATENTS.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crdt_merge-0.7.0.tar.gz.
File metadata
- Download URL: crdt_merge-0.7.0.tar.gz
- Upload date:
- Size: 224.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
42e0abd1e06107bfb38476187e396cc04b978d76b8e47e4eaacdb478cc537e49
|
|
| MD5 |
a9c540abe4dbf3e443b00ec1c3b18517
|
|
| BLAKE2b-256 |
5bf398386f5f6dbdc6b8c3b2b391eefc01f6af9d0d928e1e53759ef4d405f710
|
File details
Details for the file crdt_merge-0.7.0-py3-none-any.whl.
File metadata
- Download URL: crdt_merge-0.7.0-py3-none-any.whl
- Upload date:
- Size: 144.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a01dbea613366f76a01b4585de2a475778cc8fbbe5f0c2069ce27928bc3fbe34
|
|
| MD5 |
42ff6b16dbe73d575d7f43a5fe82142f
|
|
| BLAKE2b-256 |
87d0651c3b65be70e2a2bc6ddfe4cf60e23d46d9cd54a303e234357a2e9b18a3
|