All-in-one text deduplication
Project description
text-dedup
Features
- Hash-based methods such as SimHash, MinHash + LSH for near deduplication.
- SuffixArray-based method from Deduplicating Training Data Makes Language Models Better for substring exact deduplication.
- In-memory or Redis/KeyDB-cached index to handle larger than memory datasets.
Documentation
Todos
- Memory benchmark for streaming processing
- Speed benchmark for in-memory processing
- Inter-dataset deduplication
- Rewrite suffix array in Python
Thanks
- seomoz/simhash-cpp
- datasketch
- google-research/deduplicate-text-datasets
- Developed with OSS license from JetBrains
- This project is heavily influenced by the deduplication work at BigScience workshop. The original code can be found at bigscience-workshop/data-preparation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.2.0.tar.gz
(79.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text-dedup-0.2.0.tar.gz.
File metadata
- Download URL: text-dedup-0.2.0.tar.gz
- Upload date:
- Size: 79.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.0 CPython/3.10.6 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36184e9b57ffaf3d434764b9f577cc0f4846ce0d0a913bc49ca610707d1a3d6a
|
|
| MD5 |
d23517cdf9e3a51420750a21facb44b5
|
|
| BLAKE2b-256 |
72d17ce41a5953fb019f89301d9562e10d2bfa74dfac35acff7b463349724e44
|
File details
Details for the file text_dedup-0.2.0-py3-none-any.whl.
File metadata
- Download URL: text_dedup-0.2.0-py3-none-any.whl
- Upload date:
- Size: 34.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.0 CPython/3.10.6 Darwin/22.1.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b47b1e177241425da8ea09fa7484c63de131b08f708f78e00f377582eb03b965
|
|
| MD5 |
a839715af39f3c0db6ff0ab358f38454
|
|
| BLAKE2b-256 |
a5dfe1b786f02446b587d7610257b642a3074484bdd6eb2ff177c1fdd6ba2f41
|