The Merge Algebra Toolkit — composable, streaming, verified CRDT merge for datasets

These details have not been verified by PyPI

Project links

Project description

🔀 crdt-merge

Conflict-free merge, dedup & sync for DataFrames, JSON and datasets — powered by CRDTs

Merge any two datasets in one function call. No conflicts. No coordination. No data loss.

Quick Start • Why CRDTs • Benchmarks • API Reference • All Languages

🌐 Available in Every Language

Language	Package	Install	Repo
Python 🐍	`crdt-merge`	`pip install crdt-merge`	You are here
TypeScript	`crdt-merge`	`npm install crdt-merge`	crdt-merge-ts
Rust 🦀	`crdt-merge`	`cargo add crdt-merge`	crdt-merge-rs
Java ☕	`io.optitransfer:crdt-merge`	Maven / Gradle	crdt-merge-java
CLI 🖥️	included in Rust	`cargo install crdt-merge`	crdt-merge-rs

🤗 Try it in the browser →

🎯 The Problem

You have two versions of a dataset. Maybe two crawlers ran in parallel. Maybe two annotators edited the same file. Maybe you're merging community contributions.

Today: Write custom merge scripts, lose data, or block on a coordinator.

With crdt-merge: One function call. Zero conflicts. Mathematically guaranteed.

from crdt_merge import merge

merged = merge(df_a, df_b, key="id")  # done.

⚡ Quick Start

pip install crdt-merge                 # zero dependencies (pure Python)
pip install crdt-merge[pandas]         # with pandas support
pip install crdt-merge[datasets]       # with HuggingFace Datasets support
pip install crdt-merge[all]            # everything

Merge Two DataFrames

import pandas as pd
from crdt_merge import merge

# Two contributors edited the same dataset
df_a = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Charlie"]})
df_b = pd.DataFrame({"id": [2, 3, 4], "name": ["Robert", "Charlie", "Diana"]})

merged = merge(df_a, df_b, key="id")
# id=1: Alice (only in A)
# id=2: Robert (B wins — latest)
# id=3: Charlie (same in both)
# id=4: Diana (only in B)

Merge Two HuggingFace Datasets

from crdt_merge import merge_datasets

merged = merge_datasets("user/dataset-v1", "user/dataset-v2", key="id")

Deduplicate Anything

from crdt_merge import dedup

texts = ["Hello world", "hello  world", "HELLO WORLD", "Something else"]
unique, duplicate_indices = dedup(texts)
# unique = ["Hello world", "Something else"]  (case/whitespace normalized)

Deep-Merge JSON/Configs

from crdt_merge import merge_dicts

config_a = {"model": {"name": "bert", "layers": 12}, "tags": ["nlp"]}
config_b = {"model": {"name": "bert-large", "dropout": 0.1}, "tags": ["qa"]}

merged = merge_dicts(config_a, config_b)
# {"model": {"name": "bert-large", "layers": 12, "dropout": 0.1}, "tags": ["nlp", "qa"]}

See What Changed

from crdt_merge import diff

changes = diff(df_old, df_new, key="id")
print(changes["summary"])
# "+5 added, -2 removed, ~3 modified, =990 unchanged"

🧠 Why CRDTs

CRDT = Conflict-free Replicated Data Type. A data structure with one mathematical superpower:

Any two copies can merge — in any order, at any time — and the result is always identical and always correct.

Three mathematical guarantees (proven, not hoped):

Property	What it means
Commutative	`merge(A, B) == merge(B, A)` — order doesn't matter
Associative	`merge(merge(A, B), C) == merge(A, merge(B, C))` — grouping doesn't matter
Idempotent	`merge(A, A) == A` — re-merging is safe

This means: zero coordination, zero locks, zero conflicts. Two workers can independently edit a dataset and merge later — the result is mathematically guaranteed correct.

Built-in CRDT Types

Type	Use Case	Example
`GCounter`	Grow-only counters	Download counts, page views
`PNCounter`	Increment + decrement	Stock levels, balances
`LWWRegister`	Single value (latest wins)	Name, email, status fields
`ORSet`	Add/remove set	Tags, memberships, dedup sets
`LWWMap`	Key-value map	Row merges, config objects

📊 Benchmarks

Tested on real data (rotten_tomatoes dataset, 8,530 rows):

Operation	Size	Time	Throughput
DataFrame merge	1K + 1K → 1.5K	3.6ms	411K rows/sec
DataFrame merge	10K + 10K → 15K	42.6ms	352K rows/sec
DataFrame merge	50K + 50K → 75K	234ms	320K rows/sec
Exact dedup	9K texts	76ms	118K texts/sec
GCounter ops	100K increments	-	8.6M ops/sec
OR-Set ops	10K adds	-	250K+ ops/sec

Zero dependencies. Pure Python. Works offline. Works everywhere.

📖 API Reference

`merge(df_a, df_b, key=None, timestamp_col=None, prefer="latest", dedup=True)`

Merge two DataFrames (pandas, polars, or list of dicts).

key: Column to match rows. None = append + dedup.
timestamp_col: Column with timestamps for conflict resolution.
prefer: "latest" (B wins), "a", or "b".
dedup: Remove exact duplicate rows.

`dedup(items, method="exact", threshold=0.85)`

Deduplicate a list of strings. Returns (unique_items, duplicate_indices).

method: "exact" or "fuzzy" (bigram similarity).
threshold: Similarity threshold for fuzzy dedup (0.0-1.0).

`diff(df_a, df_b, key)`

Show what changed between two DataFrames. Returns added, removed, modified, unchanged counts.

`merge_dicts(a, b, timestamps_a=None, timestamps_b=None)`

Deep-merge two dicts with CRDT semantics. Nested dicts recurse, lists concatenate + dedup.

`merge_datasets(dataset_a, dataset_b, key=None, ...)`

Merge two HuggingFace Dataset objects or dataset names. Requires pip install crdt-merge[datasets].

`dedup_dataset(dataset, columns=None, method="exact", threshold=0.85)`

Deduplicate a HuggingFace Dataset. Requires pip install crdt-merge[datasets].

`DedupIndex(node_id)`

Distributed dedup index backed by CRDT OR-Set. Multiple workers build indices independently, then merge.

`MinHashDedup(num_hashes=128, threshold=0.5)`

РLocality-sensitive hashing for O(n) near-duplicate detection at scale.

🏗️ Use Cases

Dataset curation: Multiple annotators edit simultaneously — merge without conflicts
Parallel crawlers: Two crawlers produce overlapping data — merge + dedup automatically
Model training: Merge training logs, configs, and metrics from distributed runs
Community datasets: Accept contributions from multiple forks without merge conflicts
Data pipelines: Incremental processing with automatic state reconciliation
Offline-first apps: Sync data between devices that were offline for days

🤝 Contributing

PRs welcome! Run tests with:

pip install -e ".[dev]"
 pytest tests/ -v

📄 License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

For commercial licensing inquiries: leer@optitransfer.ch

Built with math, not hope. 🧬

⭑ Star on GitHub • 🤗 Try on HuggingFace • 📦 PyPI

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.9.7

Apr 17, 2026

0.9.5

Apr 9, 2026

0.9.4

Apr 2, 2026

0.9.3

Apr 2, 2026

0.9.2

Mar 30, 2026

0.9.1

Mar 30, 2026

0.9.0

Mar 30, 2026

0.8.3

Mar 30, 2026

0.8.2

Mar 29, 2026

0.8.1

Mar 29, 2026

0.8.0

Mar 29, 2026

0.7.2

Mar 29, 2026

0.7.1

Mar 28, 2026

0.7.0

Mar 28, 2026

0.6.0

Mar 28, 2026

0.5.0

Mar 27, 2026

0.4.0

Mar 27, 2026

This version

0.3.0

Mar 26, 2026

0.2.0

Mar 26, 2026

0.1.0

Mar 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crdt_merge-0.3.0.tar.gz (38.9 kB view details)

Uploaded Mar 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

crdt_merge-0.3.0-py3-none-any.whl (32.9 kB view details)

Uploaded Mar 26, 2026 Python 3

File details

Details for the file crdt_merge-0.3.0.tar.gz.

File metadata

Download URL: crdt_merge-0.3.0.tar.gz
Upload date: Mar 26, 2026
Size: 38.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for crdt_merge-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`d97160fffa9b94ef9bb974a61961b56428785fa0b700242237d351503fbf7e83`
MD5	`9eb66b302e6074eadc133a514a863b5a`
BLAKE2b-256	`b9360b05b65012cd596e552a6d6074c9662fa305ee734fd978a06fecb92c7a27`

See more details on using hashes here.

File details

Details for the file crdt_merge-0.3.0-py3-none-any.whl.

File metadata

Download URL: crdt_merge-0.3.0-py3-none-any.whl
Upload date: Mar 26, 2026
Size: 32.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for crdt_merge-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c29e07785652fbc8677e9c4f34d61e0de5eb7101d4b7fb0fbc678c483f9ad9d8`
MD5	`591933f1c838b7c8e80c04ec28898960`
BLAKE2b-256	`261aade02f2dcf48f97e141f06bb7cb32216ff078ad9d45603133902eecdce98`

See more details on using hashes here.

crdt-merge 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🔀 crdt-merge

🌐 Available in Every Language

🎯 The Problem

⚡ Quick Start

Merge Two DataFrames

Merge Two HuggingFace Datasets

Deduplicate Anything

Deep-Merge JSON/Configs

See What Changed

🧠 Why CRDTs

Built-in CRDT Types

📊 Benchmarks

📖 API Reference

merge(df_a, df_b, key=None, timestamp_col=None, prefer="latest", dedup=True)

dedup(items, method="exact", threshold=0.85)

diff(df_a, df_b, key)

merge_dicts(a, b, timestamps_a=None, timestamps_b=None)

merge_datasets(dataset_a, dataset_b, key=None, ...)

dedup_dataset(dataset, columns=None, method="exact", threshold=0.85)

DedupIndex(node_id)

MinHashDedup(num_hashes=128, threshold=0.5)

🏗️ Use Cases

🤝 Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`merge(df_a, df_b, key=None, timestamp_col=None, prefer="latest", dedup=True)`

`dedup(items, method="exact", threshold=0.85)`

`diff(df_a, df_b, key)`

`merge_dicts(a, b, timestamps_a=None, timestamps_b=None)`

`merge_datasets(dataset_a, dataset_b, key=None, ...)`

`dedup_dataset(dataset, columns=None, method="exact", threshold=0.85)`

`DedupIndex(node_id)`

`MinHashDedup(num_hashes=128, threshold=0.5)`