Skip to main content

Text deduplication with fuzzy match and more

Project description

text-dedup

Codacy Badge Codacy Badge

Features

Installation

pip install text-dedup

Usage

  • Using Sentence Transformer
from text_dedup import SentenceTransformerDeduper

df = pd.read_csv('...')

deduper = SentenceTransformerDeduper("distilbert-base-nli-stsb-mean-tokens")
df["group"] = deduper.group(df["text"].values.tolist(), show_progress_bar=True)

# dedup with group indices
df = df.drop_duplicates(["group"], keep="first")
  • Using Suffix Array for exact match
from text_dedup import SuffixArray

df = pd.read_csv('...')

deduper = SuffixArray(k=50)
groups, duplicates = deduper.fit_transform(df["text"].values.tolist())

assert len(groups) == len(df), "Invalid number of rows"
assert len(duplicates) == groups.shape[1], "Invalid number of columns"
  • Using MinHash for fuzzy match
from text_dedup import MinHashDeduper
deduper = MinHashDeduper(ngram_size=5, threshold=0.3)
groups = deduper.fit_transform(["This is a sentence.", "This is another sentence.", "This is a question.", "hello world"])
assert groups == [0, 0, 2, 3]

Benchmark (w/ a P100)

20k(5%) QQP subset

              precision    recall  f1-score   support

       False       0.75      0.89      0.81     12671
        True       0.73      0.50      0.60      7543

    accuracy                           0.75     20214
   macro avg       0.74      0.70      0.71     20214
weighted avg       0.74      0.75      0.73     20214


--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)         Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_scaling         89.9221  89.9221  89.9221  0.0000  89.9221  0.0000       0;0  0.0111       1          10
-------------------------------------------------------------------------------------------------------------

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-dedup-0.0.12.tar.gz (8.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_dedup-0.0.12-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file text-dedup-0.0.12.tar.gz.

File metadata

  • Download URL: text-dedup-0.0.12.tar.gz
  • Upload date:
  • Size: 8.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.11 Darwin/21.2.0

File hashes

Hashes for text-dedup-0.0.12.tar.gz
Algorithm Hash digest
SHA256 4c95c832bd568ff6b97292ffcdfab1fc1d9432044477e1943eaf8f00e15738c6
MD5 88e6ead190612514271903f23757a8b9
BLAKE2b-256 c82bba684330b32f2bb528d1ad9062d4498985660dacb71cb2ee568fc44b6402

See more details on using hashes here.

File details

Details for the file text_dedup-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: text_dedup-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.12 CPython/3.8.11 Darwin/21.2.0

File hashes

Hashes for text_dedup-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 8527bfa25de84e49972af34a5376069e5a04c698e4a8e32efce9d48afe543d3d
MD5 107307b3383a2278019b9e0819ebdcda
BLAKE2b-256 8f8be8236e0916cf88aa86b47fa45f8d9ccb341a7e4ec83a0672d4659502a253

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page