Skip to main content

Deduplicate near-identical embedding records by cosine similarity. Pure Python, zero runtime deps. Python port of @mukundakatta/embedding-dedupe.

Project description

embedding-dedupe

PyPI Python License: MIT

Deduplicate near-identical embedding records by cosine similarity. Pure Python, zero runtime dependencies.

Python port of @mukundakatta/embedding-dedupe. Same algorithm, ergonomic Python API.

Install

pip install embedding-dedupe
# Optional: faster cosine via numpy
pip install "embedding-dedupe[numpy]"

Usage

from embedding_dedupe import dedupe

records = [
    {"id": "a", "embedding": [0.10, 0.20, 0.30], "text": "Cats are great."},
    {"id": "b", "embedding": [0.10, 0.20, 0.31], "text": "Cats are great pets."},   # near-dup of a
    {"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
]

# Default: keep the lowest-id record from each cluster (deterministic).
dedupe(records, threshold=0.95)
# -> [
#      {"id": "a", "embedding": [0.10, 0.20, 0.30], "text": "Cats are great."},
#      {"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
#    ]

# Or: keep the record with the longest text (ties -> lowest id).
dedupe(records, threshold=0.95, keep="longest")
# -> [
#      {"id": "b", "embedding": [0.10, 0.20, 0.31], "text": "Cats are great pets."},
#      {"id": "c", "embedding": [0.90, 0.10, 0.05], "text": "Stock prices fell."},
#    ]

API

dedupe(
    records,                  # list[dict]
    threshold=0.95,           # cosine sim above which two records are duplicates
    key="id",                 # record id field name
    vector="embedding",       # embedding field name
    keep="first",             # "first" (lowest-id) or "longest" (longest .text)
) -> list[dict]

cosine(a, b) is also exported for ad-hoc use.

Algorithm

Greedy single-link clustering: scan records in input order, place each into the first existing cluster whose anchor vector is within threshold, else start a new cluster. From each cluster, return one survivor according to keep. Cluster output order matches input order (= order in which clusters were created).

O(n * k) where k is the cluster count -- fine up to ~100k records. For larger sets, plug in an ANN index upstream and dedupe within candidates.

API differences from the JS sibling

  • JS: dedupeEmbeddings(items, { threshold }) returning { kept, duplicates }. Python: dedupe(records, threshold=0.95, key=, vector=, keep=) returning a flat list[dict] of survivors.
  • The default threshold is 0.95 (was 0.98 in JS) -- tuned for typical OpenAI/Anthropic embedding noise.
  • New keep="longest" strategy mirrors a common request -- prefer the most informative chunk in each cluster.

See the JS sibling for the original heuristics and design notes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embedding_dedupe-0.1.0.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

embedding_dedupe-0.1.0-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file embedding_dedupe-0.1.0.tar.gz.

File metadata

  • Download URL: embedding_dedupe-0.1.0.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.4

File hashes

Hashes for embedding_dedupe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d8525a9d2bd3a48386f5993ed2c5d034ea731432b53234a77531906d0ab4a249
MD5 2ce7ee39b1e7abf4490e5d74610ced8f
BLAKE2b-256 2e28adf7b7c4ebc1416aadc2ad1ed14511cb1f7edb90d9bdd53d8ab1ee6eafff

See more details on using hashes here.

File details

Details for the file embedding_dedupe-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for embedding_dedupe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 24cab99651b23ae5df617292cea0efc806cfd46472513a9bd759d2df0ffb14f1
MD5 ee688e7af6290521810a831a32b89d94
BLAKE2b-256 0039c1b84676ee70f9498aea7d56d0a328990680c3fc0228b0f7ec561906af3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page