All-in-one text deduplication
Project description
text-dedup
Features
- Hash-based methods such as SimHash, MinHash + LSH for near deduplication.
- SuffixArray-based method from Deduplicating Training Data Makes Language Models Better for substring exact deduplication.
- In-memory or Redis/KeyDB-cached index to handle larger than memory datasets.
Documentation
Todos
- Memory benchmark for streaming processing
- Speed benchmark for in-memory processing
- Inter-dataset deduplication
- Rewrite suffix array in Python
Thanks
- seomoz/simhash-cpp
- datasketch
- google-research/deduplicate-text-datasets
- Developed with OSS license from JetBrains
- This project is heavily influenced by the deduplication work at BigScience workshop. The original code can be found at bigscience-workshop/data-preparation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.2.0.tar.gz
(79.8 kB
view hashes)
Built Distribution
text_dedup-0.2.0-py3-none-any.whl
(34.4 kB
view hashes)
Close
Hashes for text_dedup-0.2.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b47b1e177241425da8ea09fa7484c63de131b08f708f78e00f377582eb03b965 |
|
MD5 | a839715af39f3c0db6ff0ab358f38454 |
|
BLAKE2b-256 | a5dfe1b786f02446b587d7610257b642a3074484bdd6eb2ff177c1fdd6ba2f41 |