All-in-one text deduplication
Project description
text-dedup
A collection of data deduplication scripts.
Features
- Ready to use and modify single script for each method:
- MinHash + MinHashLSH
- SimHash
- SuffixArray Substring
- Bloom Filter
- Exact Hash
Acknowledgements
- Datasketch (MIT)
- simhash-py and simhash-cpp (MIT)
- Deduplicating Training Data Makes Language Models Better (Apache 2.0)
- BigScience (Apache 2.0)
- BigCode (Apache 2.0)
Quick Examples
In this section, we are going to deduplicate one dataset: gl
subset of oscar-corpus/OSCAR-2201
.
Suffix Array Substring Exact Deduplication
# input
python -m text_dedup.suffix_array \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/suffix_array" \
--index_name "lsh.pkl" \
--graph_name "graph.networkit" \
--dedup_name "oscar_gl_dedup" \
--column "text" \
--google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets"
# output
INFO All : 131.93s
INFO Loading : 4.36s
INFO Preprocessing : 4.81s
INFO Suffix Array : 101.79s
INFO Collect : 5.17s
INFO Restore : 0.27s
INFO Deduplicate : 13.00s
INFO Saving : 2.52s
INFO Before : 180332342 bytes (88803)
INFO After : 97646271 bytes (40404)
INFO Output : output/suffix_array/oscar_gl_dedup
MinHash Near Deduplication
# input
python -m text_dedup.minhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/minhash" \
--index_name "lsh.pkl" \
--graph_name "graph.networkit" \
--dedup_name "oscar_gl_dedup" \
--column "text" \
--ngram 1 \
--num_perm 128 \
--threshold 0.8 \
--seed 42
# output
INFO All : 52.73s
INFO Loading : 5.32s
INFO Minhash : 12.82s
INFO Index : 8.54s
INFO Save Index : 3.86s
INFO Query : 4.49s
INFO Clustering : 17.47s
INFO Deduplicate : 0.05s
INFO Save : 0.04s
INFO Before : 88803
INFO After : 43971
INFO Index : output/minhash/lsh.pkl
INFO Graph : output/minhash/graph.networkit
INFO Output : output/minhash/oscar_gl_dedup
SimHash Near Deduplication
# input
python -m text_dedup.simhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/simhash" \
--index_name "index.pkl" \
--graph_name "graph.networkit" \
--dedup_name "oscar_gl_dedup" \
--column "text" \
--ngram 6 \
--bit_diff 3 \
--num_bucket 4
# output
INFO All : 39.88s
INFO Loading : 4.45s
INFO Simhash : 1.91s
INFO Index : 5.23s
INFO Save Index : 1.44s
INFO Query : 6.57s
INFO Clustering : 16.42s
INFO Deduplicate : 0.72s
INFO Save : 3.11s
INFO Before : 88803
INFO After : 46659
INFO Index : output/simhash/index.pkl
INFO Graph : output/simhash/graph.networkit
INFO Output : output/simhash/oscar_gl_dedup
Exact Hash Exact Deduplication
# input
python -m text_dedup.exact_hash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/exact_hash" \
--dedup_name "oscar_gl_dedup" \
--column "text"
# output
INFO All : 5.34s
INFO Loading : 4.48s
INFO Processing : 0.73s
INFO Filtering : 0.07s
INFO Saving : 0.05s
INFO Before : 88803
INFO After : 47049
Bloom Filter Exact Deduplication
# input
python -m text_dedup.bloom_filter \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/bloom_filter" \
--dedup_name "oscar_gl_dedup" \
--error_rate 1e-5 \
--column "text"
# output
INFO All : 10.69s
INFO Loading : 4.44s
INFO Processing : 6.13s
INFO Filtering : 0.07s
INFO Saving : 0.05s
INFO Before : 88803
INFO After : 47045
Documentation
- TODO
Roadmap
- Memory benchmark for streaming processing
- Speed benchmark for in-memory processing
- Inter-dataset deduplication
- Rewrite suffix array in Python
- A collections of deduplication methods used in papers/datasets/projects
- SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Optimal Densification for Fast and Accurate Minwise Hashing, Fast Similarity Sketching
FAQ
Why use scripts instead of OOD classes and functions?
Early versions of the code uses object-oriented design for hashing and indexing, which was very difficult because different methods share little to no abstraction. In order to complie something that is useful, a lot of the wrapper code was used, and that actually increased the overhead of using this library. Additionally, deduplicating is often a one-time thing in data preprocessing pipeline, there isn't really a need for inline access.
Why license change?
Because the google repo is licensed under Apache 2.0, I have to update from MIT. Util that part of code is completely re-implemented, Apache 2.0. will be the license I use.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for text_dedup-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9e45256033c0ee3a728686b297139af2243bc76d412c857ea8840729930c9343 |
|
MD5 | d432b1b2613561adc2f09046f88c0c79 |
|
BLAKE2b-256 | 19ca28d3b5dd15154ebf24a8618523febc277b83b8834487175ca73ac0a40fc7 |