Skip to main content

All-in-one text deduplication tools

Project description

Python 3.12+ GitHub Codacy Badge Codacy Badge DOI

Installation

git clone https://github.com/ChenghaoMou/text-dedup
cd text-dedup
uv sync

Documentation

Github Pages

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

  • MinHash + MinHashLSH for near-duplicate detection
  • 64 or 128 bit SimHash
  • SuffixArray Substring exact deduplication
  • Bloom Filter exact deduplication

All algorithms use a config-based approach with TOML files for easy customization.

Quick Start

All deduplication scripts read from a config.toml file in the project root.

1. Configure your settings

Edit config.toml with your input data and algorithm settings:

MinHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "minhash"
text_column = "text"
seed = 42
batch_size = 10000
num_perm = 240
threshold = 0.7
false_positive_weight = 0.5
false_negative_weight = 0.5
hash_bits = 64
ngram_size = 5
check_false_positive = true

[output]
output_dir = "output"
clean_cache = false
save_clusters = true

[debug]
enable_profiling = false
SimHash Near Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "simhash"
text_column = "text"
hash_bits = 64
ngram_size = 3
bit_diff = 3

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false
Bloom Filter Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "bloom_filter"
text_column = "text"
error_rate = 1e-5
expected_elements = 100000

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false
Suffix Array Substring Exact Deduplication
[input]
input_type = "local_files"
file_type = "parquet"

[input.read_arguments]
path = "data/your_data"
split = "train"

[algorithm]
algorithm_name = "suffix_array"
text_column = "text"
google_repo_path = "third_party/deduplicate-text-datasets"
merge_strategy = "longest"
length_threshold = 100
cache_dir = ".cache"

[output]
output_dir = "output"
clean_cache = false

[debug]
enable_profiling = false

2. Run the deduplication

# MinHash
python -m text_dedup.minhash

# SimHash
python -m text_dedup.simhash

# Bloom Filter
python -m text_dedup.bloom_filter

# Suffix Array
python -m text_dedup.suffix_array

Benchmarks

pinecone/core-2020-05-10-deduplication
Algorithm Precision (Duplicates) Recall (Duplicates) Precision (Non Duplicates) Recall (Non Duplicates) Macro F1 score Accuracy Time
MinHash 0.9587 0.9416 0.9450 0.9611 0.9518 0.9277 11.09s
SimHash 0.9038 0.7323 0.7993 0.9318 0.8515 0.8375 626.11s
Exact Title Matching [^1] 0.830 0.50 0.709 0.992 0.757 0.746 -
Simhash Matching [^1] 0.697 0.247 0.598 0.985 0.631 0.616 -
Document Vector Similarity [^1] 0.912 0.779 0.861 0.986 0.885 0.883 -
Hybrid Method [^1] 0.908 0.828 0.899 0.979 0.904 0.903 -
LaBSE[^2] 0.937 0.923 0.930 0.943 0.933 0.919 -
Multilingual USE[^2] 0.917 0.907 0.918 0.927 0.917 0.909 -
Multilingual E5-Base[^2] 0.931 0.908 0.919 0.939 0.924 0.920 -
MinHash + LSH[^2] 0.929 0.902 0.915 0.938 0.921 0.918 -
RETSim Partial-Dup[^2] 0.945 0.941 0.945 0.949 0.945 0.928 -
RETSim Near-Dup[^2] 0.928 0.937 0.942 0.934 0.935 0.926 -
NEWS-COPY

Adjusted Rand Index (ARI) on NEWS-COPY dataset:

Model/Algorithm ARI Time
MinHash 0.7293 3.01s
SimHash 0.6463 140.03s
n-gram [^3] 0.440 -
SimHash[^2] 0.695 -
MinHash[^3] 0.737 -
MinHash[^2] 0.783 -
Multilingual USE[^2] 0.730 -
Multilingual E5-Base[^2] 0.742 -
S-BERT[^3] 0.700 -
RETSim Partial-Dup[^2] 0.831 -
RETSim Near-Dup[^2] 0.704 -
Re-ranking [^3] 0.937 -
Bi-encoder [^3] 0.915 -

Running Benchmarks

You can reproduce the benchmark results using the provided benchmark suite.

Quick Start with Just

# Run all benchmarks (both datasets, all algorithms)
just benchmark-all

# Run only CORE dataset benchmarks
just benchmark-core

# Run only NEWS-COPY dataset benchmarks
just benchmark-news

# Run specific algorithm on specific dataset
just benchmark-core-minhash
just benchmark-core-simhash
just benchmark-news-minhash
just benchmark-news-simhash

Configuration Files

Benchmark configuration files are located in configs/:

  • benchmark_core_minhash.toml - MinHash on CORE dataset
  • benchmark_core_simhash.toml - SimHash on CORE dataset
  • benchmark_news_minhash.toml - MinHash on NEWS-COPY dataset
  • benchmark_news_simhash.toml - SimHash on NEWS-COPY dataset

To customize benchmark parameters, edit the config files and adjust hyperparameters like num_perm, threshold, ngram_size, or bit_diff.

[^1]: Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings [^2]: RETSim: Resilient and Efficient Text Similarity [^3]: Noise-Robust De-Duplication at Scale

License

Apache 2.0

Citations

Generally, you can cite this repository as:

@software{chenghao_mou_2023_8364980,
  author       = {Chenghao Mou and
                  Chris Ha and
                  Kenneth Enevoldsen and
                  Peiyuan Liu},
  title        = {ChenghaoMou/text-dedup: Reference Snapshot},
  month        = sep,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {2023.09.20},
  doi          = {10.5281/zenodo.8364980},
  url          = {https://doi.org/10.5281/zenodo.8364980}
}

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_dedup-0.4.1.tar.gz (237.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_dedup-0.4.1-py3-none-any.whl (39.7 kB view details)

Uploaded Python 3

File details

Details for the file text_dedup-0.4.1.tar.gz.

File metadata

  • Download URL: text_dedup-0.4.1.tar.gz
  • Upload date:
  • Size: 237.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for text_dedup-0.4.1.tar.gz
Algorithm Hash digest
SHA256 531a9f641e452e7e8427a98fa5ba8a739079de73da927a50997bcc71d4555d6a
MD5 f3cf6527a3938bb8b55bd95d4f181738
BLAKE2b-256 d47e8f1fa24d222c038f2bca063166c39b743c2225a0eafaefcbf2e6030bb1bf

See more details on using hashes here.

File details

Details for the file text_dedup-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: text_dedup-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 39.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.4

File hashes

Hashes for text_dedup-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 95553dd81b2850ee6c984097f01289dc4a229de4920cee304f76da133a0f5539
MD5 f0dc70011e7059d1887d11c9ce05e0b6
BLAKE2b-256 9045431e3e169061d0d4ecd5693e6b762aadf4a3ee93e75b0bcfbdcb7410fb7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page