All-in-one text deduplication
Project description
text-dedup
A collection of data deduplication scripts.
Features
- Ready to use and modify single script for each method:
- MinHash + MinHashLSH
- SimHash
- SuffixArray Substring
- Bloom Filter
- Exact Hash
Acknowledgements
- Datasketch (MIT)
- simhash-py and simhash-cpp (MIT)
- Deduplicating Training Data Makes Language Models Better (Apache 2.0)
- BigScience (Apache 2.0)
- BigCode (Apache 2.0)
Quick Examples
In this section, we are going to deduplicate one dataset: gl subset of oscar-corpus/OSCAR-2201.
Suffix Array Substring Exact Deduplication
# input
python -m text_dedup.suffix_array \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/suffix_array" \
--index_name "lsh.pkl" \
--graph_name "graph.networkit" \
--dedup_name "oscar_gl_dedup" \
--column "text" \
--google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets"
# output
INFO All : 131.93s
INFO Loading : 4.36s
INFO Preprocessing : 4.81s
INFO Suffix Array : 101.79s
INFO Collect : 5.17s
INFO Restore : 0.27s
INFO Deduplicate : 13.00s
INFO Saving : 2.52s
INFO Before : 180332342 bytes (88803)
INFO After : 97646271 bytes (40404)
INFO Output : output/suffix_array/oscar_gl_dedup
MinHash Near Deduplication
# input
python -m text_dedup.minhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/minhash" \
--index_name "lsh.pkl" \
--graph_name "graph.networkit" \
--dedup_name "oscar_gl_dedup" \
--column "text" \
--ngram 1 \
--num_perm 128 \
--threshold 0.8 \
--seed 42
# output
INFO All : 52.73s
INFO Loading : 5.32s
INFO Minhash : 12.82s
INFO Index : 8.54s
INFO Save Index : 3.86s
INFO Query : 4.49s
INFO Clustering : 17.47s
INFO Deduplicate : 0.05s
INFO Save : 0.04s
INFO Before : 88803
INFO After : 43971
INFO Index : output/minhash/lsh.pkl
INFO Graph : output/minhash/graph.networkit
INFO Output : output/minhash/oscar_gl_dedup
SimHash Near Deduplication
# input
python -m text_dedup.simhash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/simhash" \
--index_name "index.pkl" \
--graph_name "graph.networkit" \
--dedup_name "oscar_gl_dedup" \
--column "text" \
--ngram 6 \
--bit_diff 3 \
--num_bucket 4
# output
INFO All : 39.88s
INFO Loading : 4.45s
INFO Simhash : 1.91s
INFO Index : 5.23s
INFO Save Index : 1.44s
INFO Query : 6.57s
INFO Clustering : 16.42s
INFO Deduplicate : 0.72s
INFO Save : 3.11s
INFO Before : 88803
INFO After : 46659
INFO Index : output/simhash/index.pkl
INFO Graph : output/simhash/graph.networkit
INFO Output : output/simhash/oscar_gl_dedup
Exact Hash Exact Deduplication
# input
python -m text_dedup.exact_hash \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/exact_hash" \
--dedup_name "oscar_gl_dedup" \
--column "text"
# output
INFO All : 5.34s
INFO Loading : 4.48s
INFO Processing : 0.73s
INFO Filtering : 0.07s
INFO Saving : 0.05s
INFO Before : 88803
INFO After : 47049
Bloom Filter Exact Deduplication
# input
python -m text_dedup.bloom_filter \
--path "oscar-corpus/OSCAR-2201" \
--name "gl" \
--split "train" \
--cache_dir "./cache" \
--output_dir "output/bloom_filter" \
--dedup_name "oscar_gl_dedup" \
--error_rate 1e-5 \
--column "text"
# output
INFO All : 10.69s
INFO Loading : 4.44s
INFO Processing : 6.13s
INFO Filtering : 0.07s
INFO Saving : 0.05s
INFO Before : 88803
INFO After : 47045
Documentation
- TODO
Roadmap
- Memory benchmark for streaming processing
- Speed benchmark for in-memory processing
- Inter-dataset deduplication
- Rewrite suffix array in Python
- A collections of deduplication methods used in papers/datasets/projects
- SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Optimal Densification for Fast and Accurate Minwise Hashing, Fast Similarity Sketching
FAQ
Why use scripts instead of OOD classes and functions?
Early versions of the code uses object-oriented design for hashing and indexing, which was very difficult because different methods share little to no abstraction. In order to complie something that is useful, a lot of the wrapper code was used, and that actually increased the overhead of using this library. Additionally, deduplicating is often a one-time thing in data preprocessing pipeline, there isn't really a need for inline access.
Why license change?
Because the google repo is licensed under Apache 2.0, I have to update from MIT. Util that part of code is completely re-implemented, Apache 2.0. will be the license I use.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_dedup-0.3.0.tar.gz.
File metadata
- Download URL: text_dedup-0.3.0.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.10.6 Darwin/22.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb5b754382267f93adf3f31b046ab3108fd2628b3e797df6b979a962e152f575
|
|
| MD5 |
ca34855fa9024d2ec5885c08cbda094e
|
|
| BLAKE2b-256 |
6708989fe1238ff8baf38d136da37d18247888897e5de5c42c14b257e2af857c
|
File details
Details for the file text_dedup-0.3.0-py3-none-any.whl.
File metadata
- Download URL: text_dedup-0.3.0-py3-none-any.whl
- Upload date:
- Size: 23.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.2.2 CPython/3.10.6 Darwin/22.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9e45256033c0ee3a728686b297139af2243bc76d412c857ea8840729930c9343
|
|
| MD5 |
d432b1b2613561adc2f09046f88c0c79
|
|
| BLAKE2b-256 |
19ca28d3b5dd15154ebf24a8618523febc277b83b8834487175ca73ac0a40fc7
|