All-in-one text deduplication tools
Project description
Installation
git clone https://github.com/ChenghaoMou/text-dedup
cd text-dedup
uv sync
Documentation
Features
This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:
- MinHash + MinHashLSH for near-duplicate detection
- 64 or 128 bit SimHash
- SuffixArray Substring exact deduplication
- Bloom Filter exact deduplication
All algorithms use a config-based approach with TOML files for easy customization.
Quick Start
All deduplication scripts read from a config.toml file in the project root.
1. Configure your settings
Edit config.toml with your input data and algorithm settings:
MinHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"
[input.read_arguments]
path = "data/your_data"
split = "train"
[algorithm]
algorithm_name = "minhash"
text_column = "text"
seed = 42
batch_size = 10000
num_perm = 240
threshold = 0.7
false_positive_weight = 0.5
false_negative_weight = 0.5
hash_bits = 64
ngram_size = 5
check_false_positive = true
[output]
output_dir = "output"
clean_cache = false
save_clusters = true
[debug]
enable_profiling = false
SimHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"
[input.read_arguments]
path = "data/your_data"
split = "train"
[algorithm]
algorithm_name = "simhash"
text_column = "text"
hash_bits = 64
ngram_size = 3
bit_diff = 3
[output]
output_dir = "output"
clean_cache = false
[debug]
enable_profiling = false
Bloom Filter Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"
[input.read_arguments]
path = "data/your_data"
split = "train"
[algorithm]
algorithm_name = "bloom_filter"
text_column = "text"
error_rate = 1e-5
expected_elements = 100000
[output]
output_dir = "output"
clean_cache = false
[debug]
enable_profiling = false
Suffix Array Substring Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"
[input.read_arguments]
path = "data/your_data"
split = "train"
[algorithm]
algorithm_name = "suffix_array"
text_column = "text"
google_repo_path = "third_party/deduplicate-text-datasets"
merge_strategy = "longest"
length_threshold = 100
cache_dir = ".cache"
[output]
output_dir = "output"
clean_cache = false
[debug]
enable_profiling = false
2. Run the deduplication
# MinHash
python -m text_dedup.minhash
# SimHash
python -m text_dedup.simhash
# Bloom Filter
python -m text_dedup.bloom_filter
# Suffix Array
python -m text_dedup.suffix_array
Benchmarks
pinecone/core-2020-05-10-deduplication
| Algorithm | Precision (Duplicates) | Recall (Duplicates) | Precision (Non Duplicates) | Recall (Non Duplicates) | Macro F1 score | Accuracy | Time |
|---|---|---|---|---|---|---|---|
| MinHash | 0.9587 | 0.9416 | 0.9450 | 0.9611 | 0.9518 | 0.9277 | 11.09s |
| SimHash | 0.9038 | 0.7323 | 0.7993 | 0.9318 | 0.8515 | 0.8375 | 626.11s |
| Exact Title Matching [^1] | 0.830 | 0.50 | 0.709 | 0.992 | 0.757 | 0.746 | - |
| Simhash Matching [^1] | 0.697 | 0.247 | 0.598 | 0.985 | 0.631 | 0.616 | - |
| Document Vector Similarity [^1] | 0.912 | 0.779 | 0.861 | 0.986 | 0.885 | 0.883 | - |
| Hybrid Method [^1] | 0.908 | 0.828 | 0.899 | 0.979 | 0.904 | 0.903 | - |
| LaBSE[^2] | 0.937 | 0.923 | 0.930 | 0.943 | 0.933 | 0.919 | - |
| Multilingual USE[^2] | 0.917 | 0.907 | 0.918 | 0.927 | 0.917 | 0.909 | - |
| Multilingual E5-Base[^2] | 0.931 | 0.908 | 0.919 | 0.939 | 0.924 | 0.920 | - |
| MinHash + LSH[^2] | 0.929 | 0.902 | 0.915 | 0.938 | 0.921 | 0.918 | - |
| RETSim Partial-Dup[^2] | 0.945 | 0.941 | 0.945 | 0.949 | 0.945 | 0.928 | - |
| RETSim Near-Dup[^2] | 0.928 | 0.937 | 0.942 | 0.934 | 0.935 | 0.926 | - |
NEWS-COPY
Adjusted Rand Index (ARI) on NEWS-COPY dataset:
| Model/Algorithm | ARI | Time |
|---|---|---|
| MinHash | 0.7293 | 3.01s |
| SimHash | 0.6463 | 140.03s |
| n-gram [^3] | 0.440 | - |
| SimHash[^2] | 0.695 | - |
| MinHash[^3] | 0.737 | - |
| MinHash[^2] | 0.783 | - |
| Multilingual USE[^2] | 0.730 | - |
| Multilingual E5-Base[^2] | 0.742 | - |
| S-BERT[^3] | 0.700 | - |
| RETSim Partial-Dup[^2] | 0.831 | - |
| RETSim Near-Dup[^2] | 0.704 | - |
| Re-ranking [^3] | 0.937 | - |
| Bi-encoder [^3] | 0.915 | - |
Running Benchmarks
You can reproduce the benchmark results using the provided benchmark suite.
Quick Start with Just
# Run all benchmarks (both datasets, all algorithms)
just benchmark-all
# Run only CORE dataset benchmarks
just benchmark-core
# Run only NEWS-COPY dataset benchmarks
just benchmark-news
# Run specific algorithm on specific dataset
just benchmark-core-minhash
just benchmark-core-simhash
just benchmark-news-minhash
just benchmark-news-simhash
Configuration Files
Benchmark configuration files are located in configs/:
benchmark_core_minhash.toml- MinHash on CORE datasetbenchmark_core_simhash.toml- SimHash on CORE datasetbenchmark_news_minhash.toml- MinHash on NEWS-COPY datasetbenchmark_news_simhash.toml- SimHash on NEWS-COPY dataset
To customize benchmark parameters, edit the config files and adjust hyperparameters like num_perm, threshold, ngram_size, or bit_diff.
[^1]: Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings [^2]: RETSim: Resilient and Efficient Text Similarity [^3]: Noise-Robust De-Duplication at Scale
License
Citations
Generally, you can cite this repository as:
@software{chenghao_mou_2023_8364980,
author = {Chenghao Mou and
Chris Ha and
Kenneth Enevoldsen and
Peiyuan Liu},
title = {ChenghaoMou/text-dedup: Reference Snapshot},
month = sep,
year = 2023,
publisher = {Zenodo},
version = {2023.09.20},
doi = {10.5281/zenodo.8364980},
url = {https://doi.org/10.5281/zenodo.8364980}
}
Acknowledgements
This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!
- Datasketch (MIT)
- simhash-py and simhash-cpp (MIT)
- Deduplicating Training Data Makes Language Models Better (Apache 2.0)
- Gaoya (MIT)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file text_dedup-0.4.1.tar.gz.
File metadata
- Download URL: text_dedup-0.4.1.tar.gz
- Upload date:
- Size: 237.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
531a9f641e452e7e8427a98fa5ba8a739079de73da927a50997bcc71d4555d6a
|
|
| MD5 |
f3cf6527a3938bb8b55bd95d4f181738
|
|
| BLAKE2b-256 |
d47e8f1fa24d222c038f2bca063166c39b743c2225a0eafaefcbf2e6030bb1bf
|
File details
Details for the file text_dedup-0.4.1-py3-none-any.whl.
File metadata
- Download URL: text_dedup-0.4.1-py3-none-any.whl
- Upload date:
- Size: 39.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
95553dd81b2850ee6c984097f01289dc4a229de4920cee304f76da133a0f5539
|
|
| MD5 |
f0dc70011e7059d1887d11c9ce05e0b6
|
|
| BLAKE2b-256 |
9045431e3e169061d0d4ecd5693e6b762aadf4a3ee93e75b0bcfbdcb7410fb7f
|