Text deduplication with fuzzy match and more
Project description
text-dedup
Features
- SOTA embeddings with sentence-transformer
- Fast de-duplication with annoy
Installation
pip install text-dedup
Usage
from text_dedup import SentenceTransformerDeduper
df = pd.read_csv('...')
deduper = SentenceTransformerDeduper("distilbert-base-nli-stsb-mean-tokens")
df["group"] = deduper.group(df["text"].values.tolist(), show_progress_bar=True)
# dedup with group indices
df = df.drop_duplicates(["group"], keep="first")
Benchmark (w/ a P100)
20k(5%) QQP subset
precision recall f1-score support
False 0.75 0.89 0.81 12671
True 0.73 0.50 0.60 7543
accuracy 0.75 20214
macro avg 0.74 0.70 0.71 20214
weighted avg 0.74 0.75 0.73 20214
--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------
test_scaling 89.9221 89.9221 89.9221 0.0000 89.9221 0.0000 0;0 0.0111 1 10
-------------------------------------------------------------------------------------------------------------
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.0.9.tar.gz
(5.4 kB
view hashes)
Built Distribution
Close
Hashes for text_dedup-0.0.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 701433960e798d483bb9d6571cdf180c90de7f6cdd88ff47363ffd8ce8b50b85 |
|
MD5 | 1bf19bc10bcb553ca7d13ffb05b9850d |
|
BLAKE2b-256 | 80aaf0c398bc7830652c03ca1487bd4b9d9b0bd31fe8d3649cb27f95ae57ab79 |