All-in-one text deduplication
Project description
text-dedup
Features
- Hash-based methods such as SimHash, MinHash + LSH for near deduplication.
- SuffixArray-based method from Deduplicating Training Data Makes Language Models Better for substring exact deduplication.
- In-memory or Redis/KeyDB-cached index to handle larger than memory datasets.
Documentation
Todos
- Memory benchmark for streaming processing
- Speed benchmark for in-memory processing
- Inter-dataset deduplication
- Rewrite suffix array in Python
Thanks
- seomoz/simhash-cpp
- datasketch
- google-research/deduplicate-text-datasets
- Developed with OSS license from JetBrains
- This project is heavily influenced by the deduplication work at BigScience workshop. The original code can be found at bigscience-workshop/data-preparation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.2.1.tar.gz
(24.8 kB
view hashes)
Built Distribution
text_dedup-0.2.1-py3-none-any.whl
(34.8 kB
view hashes)
Close
Hashes for text_dedup-0.2.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24184feae4ffc4160ed998dec48c8498d1368c329802bf4df3bb74a48c15358f |
|
MD5 | de29463d073df9f5cb8d9cfff21f2686 |
|
BLAKE2b-256 | 2d469ddc6070843e8aae5bb988241d0f17a20c9d3718be8bb5c369796cb3cc3c |