All-in-one text deduplication
Project description
text-dedup
Features
- Hash-based methods such as SimHash, MinHash + LSH for near deduplication.
- SuffixArray-based method from Deduplicating Training Data Makes Language Models Better for substring exact deduplication.
- In-memory or Redis/KeyDB-cached index to handle larger than memory datasets.
Documentation
Todos
- Memory benchmark for streaming processing
- Speed benchmark for in-memory processing
- Inter-dataset deduplication
- Rewrite suffix array in Python
Thanks
- seomoz/simhash-cpp
- datasketch
- google-research/deduplicate-text-datasets
- Developed with OSS license from JetBrains
- This project is heavily influenced by the deduplication work at BigScience workshop. The original code can be found at bigscience-workshop/data-preparation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.1.1.tar.gz
(19.5 kB
view hashes)
Built Distribution
text_dedup-0.1.1-py3-none-any.whl
(25.0 kB
view hashes)
Close
Hashes for text_dedup-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6db96cae2fbfe345c9f0f717c97133194faa5c9c426fbccb7c2c7e0d7a685711 |
|
MD5 | d9b4cb0aafa0f39b83d3219b617102d9 |
|
BLAKE2b-256 | 23f7dd7330cb662a4d3af3265b33ec3a74637c61c764f39533032272b6c6b7f6 |