text-dedup

All-in-one text deduplication

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

GitHub

Features

This repository contains a collection of text deduplication scripts that are ready to use, or modify based on your needs:

MinHash + MinHashLSH, including a spark implementation suitable for large (TB) datasets
64 or 128 bit SimHash
SuffixArray Substring
Bloom Filter
Exact Hash

I also have big plans for the future:

Memory benchmark for streaming processing
Inter-dataset deduplication
Rewrite suffix array in Python
A collections of other deduplication methods: SuperMinHash, ProbMinHash, TreeMinHash, BagMinHash, Optimal Densification for Fast and Accurate Minwise Hashing, Fast Similarity Sketching
A collections of other deduplication methods used in other places: CCNet.

However, I do not intent to build a general purpose deduplication library, which was the goal of this repo early on. I will gradually retire the pypi package as well. The reason behind it is that each use-case can be wildly different and requires careful design and consideration. I sincerely encourage you to read the script first (they are relatively short) so you can understand what are at stake here when using it. You can use it to bootstrap your own script, or just use it as a reference.

Acknowledgements

This repository is inspired by the following projects, and is heavily influenced by lessons learned from my own participation in BigScience (Apache 2.0) and BigCode (Apache 2.0). There is a blog post about the journey. Feedbacks are welcome!

Datasketch (MIT)
simhash-py and simhash-cpp (MIT)
Deduplicating Training Data Makes Language Models Better (Apache 2.0)
Gaoya (MIT)

Quick Examples

PySpark with DataProc

Not a lot of people have access to enough compute resources or the need to deduplicate TB-scale datasets, but if you do, this is a good example of how to use it with GCP DataProc.

MODIFY text_dedup/minhash_spark.py FOR YOUR OWN PROJECT AND DATASET FIRST!

export CLUSTER_NAME=chenghao-temp
export PROJECT_ID=xx

gcloud dataproc clusters create $CLUSTER_NAME \
    --enable-component-gateway \
    --region us-central1 \
    --zone us-central1-a \
    --master-machine-type c2d-standard-16 \
    --master-boot-disk-size 500 \
    --num-workers 10 \
    --worker-machine-type c2d-standard-16 \
    --worker-boot-disk-size 500 \
    --image-version 2.0-debian10 \
    --project $PROJECT_ID

gcloud dataproc jobs submit pyspark --cluster ${CLUSTER_NAME} \
    --region us-central1 \
    --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
    --driver-log-levels root=WARN \
    --properties="spark.executor.memory"="50g","spark.driver.memory"="8g","spark.executor.cores"="14" \
    minhash_spark.py -- \
    --table "huggingface-science-codeparrot.the_stack_java.java" \
    --output "gs://chenghao-data/dataproc_output/deduplicated"

For reference, the script finished deduplicating 42 million rows in less than 40 minutes with above settings (160 cores, 640GB memory in total), while the python version would take around 10 hours with a 80-core machine with 1.8TB memory.

In the following part, we are going to deduplicate one dataset: gl subset of oscar-corpus/OSCAR-2201.

Suffix Array Substring Exact Deduplication

# input
python -m text_dedup.suffix_array \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/suffix_array/oscar_gl_dedup" \
    --column "text" \
    --google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets"

# output
INFO     Loading                       : 2.75 seconds
INFO     Preprocessing                 : 4.78 seconds
INFO     SuffixArray                   : 98.29 seconds
INFO     SelfSimilar                   : 4.24 seconds
INFO     Restore                       : 0.25 seconds
INFO     Deduplicate                   : 6.23 seconds
INFO     Saving                        : 8.91 seconds
INFO     Total                         : 125.45 seconds
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)

MinHash Near Deduplication

# input
python -m text_dedup.minhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/minhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000

# output
INFO     Loading                         : 2.62 seconds
INFO     MinHashing                      : 0.08 seconds
INFO     Clustering                      : 2.20 seconds
INFO     Filtering                       : 0.53 seconds
INFO     Saving                          : 9.86 seconds
INFO     Total                           : 15.29 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 44124 (49.69%)
INFO     Duplicate Number                : 44679 (50.31%)
INFO     🤗 Happy Deduplicating 🤗

SimHash Near Deduplication

# input
python -m text_dedup.simhash \
  --path "oscar-corpus/OSCAR-2201" \
  --name "gl" \
  --split "train" \
  --cache_dir "./cache" \
  --output "output/simhash/oscar_gl_dedup" \
  --column "text" \
  --batch_size 10000

# output
INFO     Loading                         : 2.60 seconds
INFO     SimHashing                      : 0.04 seconds
INFO     Indexing                        : 28.88 seconds
INFO     Filtering                       : 0.88 seconds
INFO     Saving                          : 10.41 seconds
INFO     Total                           : 42.80 seconds
INFO     Data Number (before)            : 88803
INFO     Data Number (after)             : 46163 (51.98%)
INFO     Duplicate Number                : 42640 (48.02%)
INFO     🤗 Happy Deduplicating 🤗

Exact Hash Exact Deduplication

# input
python -m text_dedup.exact_hash \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/exact_hash/oscar_gl_dedup" \
    --column "text" \
    --batch_size 1000

# output
INFO     Loading                       : 2.95s
INFO     Processing                    : 3.79s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.89s
INFO     Total                         : 9.72s
INFO     Before                        : 88803
INFO     After                         : 47049

Bloom Filter Exact Deduplication

# input
python -m text_dedup.bloom_filter \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output "output/bloom_filter/oscar_gl_dedup" \
    --error_rate 1e-5 \
    --column "text" \
    --batch_size 1000

# output
INFO     Loading                       : 2.72s
INFO     Processing                    : 4.84s
INFO     Filtering                     : 0.10s
INFO     Saving                        : 2.88s
INFO     Total                         : 10.54s
INFO     Before                        : 88803
INFO     After                         : 47045

Benchmarks

A benchmark of different methods here can be found in benchmarks/wiki40.ipynb. A notebook in evaluating MinHash on pinecone/core-2020-05-10-deduplication can be found in benchmarks/pinecone.ipynb.

For quick reference, here are the results:

Method	Precision	Recall	F1	Time
MinHash	0.9464	0.9446	0.9455	24s
SimHash*	0.9011	0.6959	0.7853	210s
SimHash(Gyawali et al., LREC 2020)	0.697	0.247	0.3647	-
Exact Title (my implementation)	0.8302	0.5521	0.6632	-
Exact Title(Gyawali et al., LREC 2020)	0.830	0.50	0.624	-

*Best SimHash result from benchmarks/hyperparameter.ipynb.

License

Apache 2.0

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.0

Apr 17, 2024

This version

0.3.1

Mar 25, 2023

0.3.0

Nov 5, 2022

0.2.1

Sep 29, 2022

0.2.0

Sep 24, 2022

0.1.1

Sep 4, 2022

0.1.0

Aug 27, 2022

0.0.18

Jun 20, 2022

0.0.17

Jun 15, 2022

0.0.16

Jun 14, 2022

0.0.15

May 29, 2022

0.0.14

May 29, 2022

0.0.13

Apr 2, 2022

0.0.12

Dec 25, 2021

0.0.11

Jul 24, 2021

0.0.10

Jul 24, 2021

0.0.9

Jun 5, 2021

0.0.7

Apr 14, 2021

0.0.6

Apr 11, 2021

0.0.5

Apr 3, 2021

0.0.4

Mar 29, 2021

0.0.3

Mar 14, 2021

0.0.2

Mar 14, 2021

0.0.1

Mar 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-dedup-0.3.1.tar.gz (26.6 kB view hashes)

Uploaded Mar 25, 2023 Source

Built Distribution

text_dedup-0.3.1-py3-none-any.whl (30.5 kB view hashes)

Uploaded Mar 25, 2023 Python 3

Hashes for text-dedup-0.3.1.tar.gz

Hashes for text-dedup-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`86e0ce78d6bf202bd0529af83b9ff7118bfd58830c0e9f797b263ed28a70eb2b`
MD5	`6b3313882fddcfffcad463722b3b4140`
BLAKE2b-256	`c3168b8577e23b49772be8d6654f0a38e545df5d36151d33f864d5bc5494952b`

Hashes for text_dedup-0.3.1-py3-none-any.whl

Hashes for text_dedup-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bd4735a65b6be12ccd1b469dd9c33329b2740b5301fb2d27b4f79d7eb5952cde`
MD5	`09d7bbf517960bfd0ae1607119268611`
BLAKE2b-256	`727f2118609f4833febad33a87748a4eb1c207d59bca014008e23db1ffc167e9`