Text deduplication with fuzzy match and more
Project description
text-dedup
Features
- SOTA embeddings with sentence-transformer
- Fast de-duplication with annoy
- Suffix Array and MinHash Deduplicating Training Data Makes Language Models Better
Installation
pip install text-dedup
Usage
- Using Sentence Transformer
from text_dedup import SentenceTransformerDeduper
df = pd.read_csv('...')
deduper = SentenceTransformerDeduper("distilbert-base-nli-stsb-mean-tokens")
df["group"] = deduper.group(df["text"].values.tolist(), show_progress_bar=True)
# dedup with group indices
df = df.drop_duplicates(["group"], keep="first")
- Using Suffix Array for exact match
from text_dedup import SuffixArray
df = pd.read_csv('...')
deduper = SuffixArray(k=50)
groups, duplicates = deduper.fit_transform(df["text"].values.tolist())
assert len(groups) == len(df), "Invalid number of rows"
assert len(duplicates) == groups.shape[1], "Invalid number of columns"
- Using MinHash for fuzzy match
from text_dedup import MinHashDeduper
deduper = MinHashDeduper(ngram_size=5, threshold=0.3)
groups = deduper.fit_transform(["This is a sentence.", "This is another sentence.", "This is a question.", "hello world"])
assert groups == [0, 0, 2, 3]
Benchmark (w/ a P100)
20k(5%) QQP subset
precision recall f1-score support
False 0.75 0.89 0.81 12671
True 0.73 0.50 0.60 7543
accuracy 0.75 20214
macro avg 0.74 0.70 0.71 20214
weighted avg 0.74 0.75 0.73 20214
--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------
test_scaling 89.9221 89.9221 89.9221 0.0000 89.9221 0.0000 0;0 0.0111 1 10
-------------------------------------------------------------------------------------------------------------
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.0.12.tar.gz
(8.2 kB
view hashes)
Built Distribution
Close
Hashes for text_dedup-0.0.12-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8527bfa25de84e49972af34a5376069e5a04c698e4a8e32efce9d48afe543d3d |
|
MD5 | 107307b3383a2278019b9e0819ebdcda |
|
BLAKE2b-256 | 8f8be8236e0916cf88aa86b47fa45f8d9ccb341a7e4ec83a0672d4659502a253 |