Text deduplication with fuzzy match and more
Project description
text-dedup
Features
- SOTA embeddings with sentence-transformer
- Fast de-duplication with annoy
- Deduplicating Training Data Makes Language Models Better
Installation
pip install text-dedup
Usage
from text_dedup import SentenceTransformerDeduper
df = pd.read_csv('...')
deduper = SentenceTransformerDeduper("distilbert-base-nli-stsb-mean-tokens")
df["group"] = deduper.group(df["text"].values.tolist(), show_progress_bar=True)
# dedup with group indices
df = df.drop_duplicates(["group"], keep="first")
from text_dedup import SuffixArray
df = pd.read_csv('...')
deduper = SuffixArray(k=50)
groups, duplicates = deduper.fit_transform(df["text"].values.tolist())
assert len(groups) == len(df), "Invalid number of rows"
assert len(duplicates) == groups.shape[1], "Invalid number of columns"
Benchmark (w/ a P100)
20k(5%) QQP subset
precision recall f1-score support
False 0.75 0.89 0.81 12671
True 0.73 0.50 0.60 7543
accuracy 0.75 20214
macro avg 0.74 0.70 0.71 20214
weighted avg 0.74 0.75 0.73 20214
--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------
test_scaling 89.9221 89.9221 89.9221 0.0000 89.9221 0.0000 0;0 0.0111 1 10
-------------------------------------------------------------------------------------------------------------
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.0.10.tar.gz
(6.8 kB
view hashes)
Built Distribution
Close
Hashes for text_dedup-0.0.10-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0ede89ed289642461a750b04cbaf1a1461a0ba703c12cd96be66e742bf34776e |
|
MD5 | 3e04c5b570000be03a4ececdf55409d4 |
|
BLAKE2b-256 | f981e2b935363503b9c62d94fabea382834622f094253c09c0f872213dcdd17b |