Deduplicate near-identical embedding records by cosine similarity. Pure Python, zero runtime deps. Python port of @mukundakatta/embedding-dedupe.

These details have not been verified by PyPI

Project links

Project description

embedding-dedupe

Deduplicate near-identical embedding records by cosine similarity. Pure Python, zero runtime dependencies.

Python port of @mukundakatta/embedding-dedupe. Same algorithm, ergonomic Python API.

Install

pip install embedding-dedupe
# Optional: faster cosine via numpy
pip install "embedding-dedupe[numpy]"

Usage

from embedding_dedupe import dedupe

records = [
    {"id": "a", "embedding": [0.10, 0.20, 0.30], "text": "Cats are great."},
    {"id": "b", "embedding": [0.10, 0.20, 0.31], "text": "Cats are great pets."},   # near-dup of a
    {"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
]

# Default: keep the lowest-id record from each cluster (deterministic).
dedupe(records, threshold=0.95)
# -> [
#      {"id": "a", "embedding": [0.10, 0.20, 0.30], "text": "Cats are great."},
#      {"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
#    ]

# Or: keep the record with the longest text (ties -> lowest id).
dedupe(records, threshold=0.95, keep="longest")
# -> [
#      {"id": "b", "embedding": [0.10, 0.20, 0.31], "text": "Cats are great pets."},
#      {"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
#    ]

API

dedupe(
    records,                  # list[dict]
    threshold=0.95,           # cosine sim above which two records are duplicates
    key="id",                 # record id field name
    vector="embedding",       # embedding field name
    keep="first",             # "first" (lowest-id) or "longest" (longest .text)
) -> list[dict]

cosine(a, b) is also exported for ad-hoc use.

Algorithm

Greedy single-link clustering: scan records in input order, place each into the first existing cluster whose anchor vector is within threshold, else start a new cluster. From each cluster, return one survivor according to keep. Cluster output order matches input order (= order in which clusters were created).

O(n * k) where k is the cluster count -- fine up to ~100k records. For larger sets, plug in an ANN index upstream and dedupe within candidates.

API differences from the JS sibling

JS: dedupeEmbeddings(items, { threshold }) returning { kept, duplicates }. Python: dedupe(records, threshold=0.95, key=, vector=, keep=) returning a flat list[dict] of survivors.
The default threshold is 0.95 (was 0.98 in JS) -- tuned for typical OpenAI/Anthropic embedding noise.
New keep="longest" strategy mirrors a common request -- prefer the most informative chunk in each cluster.

See the JS sibling for the original heuristics and design notes.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_dedupe-0.1.0.tar.gz (5.7 kB view details)

Uploaded Apr 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

embedding_dedupe-0.1.0-py3-none-any.whl (5.7 kB view details)

Uploaded Apr 27, 2026 Python 3

File details

Details for the file embedding_dedupe-0.1.0.tar.gz.

File metadata

Download URL: embedding_dedupe-0.1.0.tar.gz
Upload date: Apr 27, 2026
Size: 5.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for embedding_dedupe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d8525a9d2bd3a48386f5993ed2c5d034ea731432b53234a77531906d0ab4a249`
MD5	`2ce7ee39b1e7abf4490e5d74610ced8f`
BLAKE2b-256	`2e28adf7b7c4ebc1416aadc2ad1ed14511cb1f7edb90d9bdd53d8ab1ee6eafff`

See more details on using hashes here.

File details

Details for the file embedding_dedupe-0.1.0-py3-none-any.whl.

File metadata

Download URL: embedding_dedupe-0.1.0-py3-none-any.whl
Upload date: Apr 27, 2026
Size: 5.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for embedding_dedupe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`24cab99651b23ae5df617292cea0efc806cfd46472513a9bd759d2df0ffb14f1`
MD5	`ee688e7af6290521810a831a32b89d94`
BLAKE2b-256	`0039c1b84676ee70f9498aea7d56d0a328990680c3fc0228b0f7ec561906af3f`

See more details on using hashes here.

embedding-dedupe 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

embedding-dedupe

Install

Usage

API

Algorithm

API differences from the JS sibling

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes