The Merge Algebra Toolkit — composable, streaming, verified CRDT merge for datasets
Project description
🔀 crdt-merge
Conflict-free merge, dedup & sync for DataFrames, JSON and datasets — powered by CRDTs
Merge any two datasets in one function call. No conflicts. No coordination. No data loss.
Quick Start • Why CRDTs • Benchmarks • API Reference • All Languages
🌐 Available in Every Language
| Language | Package | Install | Repo |
|---|---|---|---|
| Python 🐍 | crdt-merge |
pip install crdt-merge |
You are here |
| TypeScript | crdt-merge |
npm install crdt-merge |
crdt-merge-ts |
| Rust 🦀 | crdt-merge |
cargo add crdt-merge |
crdt-merge-rs |
| Java ☕ | io.optitransfer:crdt-merge |
Maven / Gradle | crdt-merge-java |
| CLI 🖥️ | included in Rust | cargo install crdt-merge |
crdt-merge-rs |
🎯 The Problem
You have two versions of a dataset. Maybe two crawlers ran in parallel. Maybe two annotators edited the same file. Maybe you're merging community contributions.
Today: Write custom merge scripts, lose data, or block on a coordinator.
With crdt-merge: One function call. Zero conflicts. Mathematically guaranteed.
from crdt_merge import merge
merged = merge(df_a, df_b, key="id") # done.
⚡ Quick Start
pip install crdt-merge # zero dependencies (pure Python)
pip install crdt-merge[pandas] # with pandas support
pip install crdt-merge[datasets] # with HuggingFace Datasets support
pip install crdt-merge[all] # everything
Merge Two DataFrames
import pandas as pd
from crdt_merge import merge
# Two contributors edited the same dataset
df_a = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
df_b = pd.DataFrame({"id": [2, 3, 4], "name": ["Robert", "Charlie", "Diana"]})
merged = merge(df_a, df_b, key="id")
# id=1: Alice (only in A)
# id=2: Robert (B wins — latest)
# id=3: Charlie (same in both)
# id=4: Diana (only in B)
Merge Two HuggingFace Datasets
from crdt_merge import merge_datasets
merged = merge_datasets("user/dataset-v1", "user/dataset-v2", key="id")
Deduplicate Anything
from crdt_merge import dedup
texts = ["Hello world", "hello world", "HELLO WORLD", "Something else"]
unique, duplicate_indices = dedup(texts)
# unique = ["Hello world", "Something else"] (case/whitespace normalized)
Deep-Merge JSON/Configs
from crdt_merge import merge_dicts
config_a = {"model": {"name": "bert", "layers": 12}, "tags": ["nlp"]}
config_b = {"model": {"name": "bert-large", "dropout": 0.1}, "tags": ["qa"]}
merged = merge_dicts(config_a, config_b)
# {"model": {"name": "bert-large", "layers": 12, "dropout": 0.1}, "tags": ["nlp", "qa"]}
See What Changed
from crdt_merge import diff
changes = diff(df_old, df_new, key="id")
print(changes["summary"])
# "+5 added, -2 removed, ~3 modified, =990 unchanged"
🧠 Why CRDTs
CRDT = Conflict-free Replicated Data Type. A data structure with one mathematical superpower:
Any two copies can merge — in any order, at any time — and the result is always identical and always correct.
Three mathematical guarantees (proven, not hoped):
| Property | What it means |
|---|---|
| Commutative | merge(A, B) == merge(B, A) — order doesn't matter |
| Associative | merge(merge(A, B), C) == merge(A, merge(B, C)) — grouping doesn't matter |
| Idempotent | merge(A, A) == A — re-merging is safe |
This means: zero coordination, zero locks, zero conflicts. Two workers can independently edit a dataset and merge later — the result is mathematically guaranteed correct.
Built-in CRDT Types
| Type | Use Case | Example |
|---|---|---|
GCounter |
Grow-only counters | Download counts, page views |
PNCounter |
Increment + decrement | Stock levels, balances |
LWWRegister |
Single value (latest wins) | Name, email, status fields |
ORSet |
Add/remove set | Tags, memberships, dedup sets |
LWWMap |
Key-value map | Row merges, config objects |
📊 Benchmarks
Tested on real data (rotten_tomatoes dataset, 8,530 rows):
| Operation | Size | Time | Throughput |
|---|---|---|---|
| DataFrame merge | 1K + 1K → 1.5K | 3.6ms | 411K rows/sec |
| DataFrame merge | 10K + 10K → 15K | 42.6ms | 352K rows/sec |
| DataFrame merge | 50K + 50K → 75K | 234ms | 320K rows/sec |
| Exact dedup | 9K texts | 76ms | 118K texts/sec |
| GCounter ops | 100K increments | - | 8.6M ops/sec |
| OR-Set ops | 10K adds | - | 250K+ ops/sec |
Zero dependencies. Pure Python. Works offline. Works everywhere.
📖 API Reference
merge(df_a, df_b, key=None, timestamp_col=None, prefer="latest", dedup=True)
Merge two DataFrames (pandas, polars, or list of dicts).
- key: Column to match rows.
None= append + dedup. - timestamp_col: Column with timestamps for conflict resolution.
- prefer:
"latest"(B wins),"a", or"b". - dedup: Remove exact duplicate rows.
dedup(items, method="exact", threshold=0.85)
Deduplicate a list of strings. Returns (unique_items, duplicate_indices).
- method:
"exact"or"fuzzy"(bigram similarity). - threshold: Similarity threshold for fuzzy dedup (0.0-1.0).
diff(df_a, df_b, key)
Show what changed between two DataFrames. Returns added, removed, modified, unchanged counts.
merge_dicts(a, b, timestamps_a=None, timestamps_b=None)
Deep-merge two dicts with CRDT semantics. Nested dicts recurse, lists concatenate + dedup.
merge_datasets(dataset_a, dataset_b, key=None, ...)
Merge two HuggingFace Dataset objects or dataset names. Requires pip install crdt-merge[datasets].
dedup_dataset(dataset, columns=None, method="exact", threshold=0.85)
Deduplicate a HuggingFace Dataset. Requires pip install crdt-merge[datasets].
DedupIndex(node_id)
Distributed dedup index backed by CRDT OR-Set. Multiple workers build indices independently, then merge.
MinHashDedup(num_hashes=128, threshold=0.5)
РLocality-sensitive hashing for O(n) near-duplicate detection at scale.
🏗️ Use Cases
- Dataset curation: Multiple annotators edit simultaneously — merge without conflicts
- Parallel crawlers: Two crawlers produce overlapping data — merge + dedup automatically
- Model training: Merge training logs, configs, and metrics from distributed runs
- Community datasets: Accept contributions from multiple forks without merge conflicts
- Data pipelines: Incremental processing with automatic state reconciliation
- Offline-first apps: Sync data between devices that were offline for days
🤝 Contributing
PRs welcome! Run tests with:
pip install -e ".[dev]"
pytest tests/ -v
📄 License
Licensed under the Apache License, Version 2.0. See LICENSE for details.
Copyright 2026 Ryan Gillespie / Optitransfer. See NOTICE for attribution requirements.
For commercial licensing inquiries: leer@optitransfer.ch
Built with math, not hope. 🧬
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crdt_merge-0.3.0.tar.gz.
File metadata
- Download URL: crdt_merge-0.3.0.tar.gz
- Upload date:
- Size: 38.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d97160fffa9b94ef9bb974a61961b56428785fa0b700242237d351503fbf7e83
|
|
| MD5 |
9eb66b302e6074eadc133a514a863b5a
|
|
| BLAKE2b-256 |
b9360b05b65012cd596e552a6d6074c9662fa305ee734fd978a06fecb92c7a27
|
File details
Details for the file crdt_merge-0.3.0-py3-none-any.whl.
File metadata
- Download URL: crdt_merge-0.3.0-py3-none-any.whl
- Upload date:
- Size: 32.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c29e07785652fbc8677e9c4f34d61e0de5eb7101d4b7fb0fbc678c483f9ad9d8
|
|
| MD5 |
591933f1c838b7c8e80c04ec28898960
|
|
| BLAKE2b-256 |
261aade02f2dcf48f97e141f06bb7cb32216ff078ad9d45603133902eecdce98
|