Deduplicate near-identical embedding records by cosine similarity. Pure Python, zero runtime deps. Python port of @mukundakatta/embedding-dedupe.
Project description
embedding-dedupe
Deduplicate near-identical embedding records by cosine similarity. Pure Python, zero runtime dependencies.
Python port of @mukundakatta/embedding-dedupe. Same algorithm, ergonomic Python API.
Install
pip install embedding-dedupe
# Optional: faster cosine via numpy
pip install "embedding-dedupe[numpy]"
Usage
from embedding_dedupe import dedupe
records = [
{"id": "a", "embedding": [0.10, 0.20, 0.30], "text": "Cats are great."},
{"id": "b", "embedding": [0.10, 0.20, 0.31], "text": "Cats are great pets."}, # near-dup of a
{"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
]
# Default: keep the lowest-id record from each cluster (deterministic).
dedupe(records, threshold=0.95)
# -> [
# {"id": "a", "embedding": [0.10, 0.20, 0.30], "text": "Cats are great."},
# {"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
# ]
# Or: keep the record with the longest text (ties -> lowest id).
dedupe(records, threshold=0.95, keep="longest")
# -> [
# {"id": "b", "embedding": [0.10, 0.20, 0.31], "text": "Cats are great pets."},
# {"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
# ]
API
dedupe(
records, # list[dict]
threshold=0.95, # cosine sim above which two records are duplicates
key="id", # record id field name
vector="embedding", # embedding field name
keep="first", # "first" (lowest-id) or "longest" (longest .text)
) -> list[dict]
cosine(a, b) is also exported for ad-hoc use.
Algorithm
Greedy single-link clustering: scan records in input order, place each into the first existing cluster whose anchor vector is within threshold, else start a new cluster. From each cluster, return one survivor according to keep. Cluster output order matches input order (= order in which clusters were created).
O(n * k) where k is the cluster count -- fine up to ~100k records. For larger sets, plug in an ANN index upstream and dedupe within candidates.
API differences from the JS sibling
- JS:
dedupeEmbeddings(items, { threshold })returning{ kept, duplicates }. Python:dedupe(records, threshold=0.95, key=, vector=, keep=)returning a flatlist[dict]of survivors. - The default threshold is
0.95(was0.98in JS) -- tuned for typical OpenAI/Anthropic embedding noise. - New
keep="longest"strategy mirrors a common request -- prefer the most informative chunk in each cluster.
See the JS sibling for the original heuristics and design notes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file embedding_dedupe-0.1.0.tar.gz.
File metadata
- Download URL: embedding_dedupe-0.1.0.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8525a9d2bd3a48386f5993ed2c5d034ea731432b53234a77531906d0ab4a249
|
|
| MD5 |
2ce7ee39b1e7abf4490e5d74610ced8f
|
|
| BLAKE2b-256 |
2e28adf7b7c4ebc1416aadc2ad1ed14511cb1f7edb90d9bdd53d8ab1ee6eafff
|
File details
Details for the file embedding_dedupe-0.1.0-py3-none-any.whl.
File metadata
- Download URL: embedding_dedupe-0.1.0-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
24cab99651b23ae5df617292cea0efc806cfd46472513a9bd759d2df0ffb14f1
|
|
| MD5 |
ee688e7af6290521810a831a32b89d94
|
|
| BLAKE2b-256 |
0039c1b84676ee70f9498aea7d56d0a328990680c3fc0228b0f7ec561906af3f
|