Skip to main content

Text deduplication with fuzzy match and more

Project description

text-dedup

Codacy Badge Codacy Badge

Features

  • SOTA embeddings with sentence-transformer
  • Fast de-duplication with annoy

Installation

pip install text-dedup

Usage

from text_dedup import SentenceTransformerDeduper

df = pd.read_csv('...')

deduper = SentenceTransformerDeduper("distilbert-base-nli-stsb-mean-tokens")
df["group"] = deduper.group(df["text"].values.tolist(), show_progress_bar=True)

# dedup with group indices
df = df.drop_duplicates(["group"], keep="first")

Benchmark (w/ a P100)

20k(5%) QQP subset

              precision    recall  f1-score   support

       False       0.75      0.89      0.81     12671
        True       0.73      0.50      0.60      7543

    accuracy                           0.75     20214
   macro avg       0.74      0.70      0.71     20214
weighted avg       0.74      0.75      0.73     20214


--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)         Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_scaling         89.9221  89.9221  89.9221  0.0000  89.9221  0.0000       0;0  0.0111       1          10
-------------------------------------------------------------------------------------------------------------

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-dedup-0.0.9.tar.gz (5.4 kB view hashes)

Uploaded Source

Built Distribution

text_dedup-0.0.9-py3-none-any.whl (7.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page