Text deduplication with fuzzy match and more
Project description
text-dedup
Features
- SOTA embeddings with sentence-transformer
- Fast de-duplication with annoy
- Suffix Array and MinHash Deduplicating Training Data Makes Language Models Better
Installation
pip install text-dedup
Usage
- Using Sentence Transformer
from text_dedup import SentenceTransformerDeduper
df = pd.read_csv('...')
deduper = SentenceTransformerDeduper("distilbert-base-nli-stsb-mean-tokens")
df["group"] = deduper.group(df["text"].values.tolist(), show_progress_bar=True)
# dedup with group indices
df = df.drop_duplicates(["group"], keep="first")
- Using Suffix Array for exact match
from text_dedup import SuffixArray
df = pd.read_csv('...')
deduper = SuffixArray(k=50)
groups, duplicates = deduper.fit_transform(df["text"].values.tolist())
assert len(groups) == len(df), "Invalid number of rows"
assert len(duplicates) == groups.shape[1], "Invalid number of columns"
- Using MinHash for fuzzy match
from text_dedup import MinHashDeduper
deduper = MinHashDeduper(ngram_size=5, threshold=0.3)
groups = deduper.fit_transform(["This is a sentence.", "This is another sentence.", "This is a question.", "hello world"])
assert groups == [0, 0, 2, 3]
Benchmark (w/ a P100)
20k(5%) QQP subset
precision recall f1-score support
False 0.75 0.89 0.81 12671
True 0.73 0.50 0.60 7543
accuracy 0.75 20214
macro avg 0.74 0.70 0.71 20214
weighted avg 0.74 0.75 0.73 20214
--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-------------------------------------------------------------------------------------------------------------
test_scaling 89.9221 89.9221 89.9221 0.0000 89.9221 0.0000 0;0 0.0111 1 10
-------------------------------------------------------------------------------------------------------------
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.0.12.tar.gz
(8.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text-dedup-0.0.12.tar.gz.
File metadata
- Download URL: text-dedup-0.0.12.tar.gz
- Upload date:
- Size: 8.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.8.11 Darwin/21.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4c95c832bd568ff6b97292ffcdfab1fc1d9432044477e1943eaf8f00e15738c6
|
|
| MD5 |
88e6ead190612514271903f23757a8b9
|
|
| BLAKE2b-256 |
c82bba684330b32f2bb528d1ad9062d4498985660dacb71cb2ee568fc44b6402
|
File details
Details for the file text_dedup-0.0.12-py3-none-any.whl.
File metadata
- Download URL: text_dedup-0.0.12-py3-none-any.whl
- Upload date:
- Size: 10.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.12 CPython/3.8.11 Darwin/21.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8527bfa25de84e49972af34a5376069e5a04c698e4a8e32efce9d48afe543d3d
|
|
| MD5 |
107307b3383a2278019b9e0819ebdcda
|
|
| BLAKE2b-256 |
8f8be8236e0916cf88aa86b47fa45f8d9ccb341a7e4ec83a0672d4659502a253
|