Skip to main content

All-in-one text deduplication

Project description

text-dedup

A collection of data deduplication scripts.

GitHub Codacy Badge Codacy Badge

Features

  • Ready to use and modify single script for each method:
    • MinHash + MinHashLSH
    • SimHash
    • SuffixArray Substring
    • Bloom Filter
    • Exact Hash

Acknowledgements

Quick Examples

In this section, we are going to deduplicate one dataset: gl subset of oscar-corpus/OSCAR-2201.

Suffix Array Substring Exact Deduplication

# input
python -m text_dedup.suffix_array \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output_dir "output/suffix_array" \
    --index_name "lsh.pkl" \
    --graph_name "graph.networkit" \
    --dedup_name "oscar_gl_dedup" \
    --column "text" \
    --google_repo_path "/Users/chenghao/Downloads/Projects/text-dedup/deduplicate-text-datasets"

# output
INFO     All                           : 131.93s
INFO     Loading                       : 4.36s
INFO     Preprocessing                 : 4.81s
INFO     Suffix Array                  : 101.79s
INFO     Collect                       : 5.17s
INFO     Restore                       : 0.27s
INFO     Deduplicate                   : 13.00s
INFO     Saving                        : 2.52s
INFO     Before                        : 180332342 bytes (88803)
INFO     After                         : 97646271 bytes (40404)
INFO     Output                        : output/suffix_array/oscar_gl_dedup

MinHash Near Deduplication

# input
python -m text_dedup.minhash \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output_dir "output/minhash" \
    --index_name "lsh.pkl" \
    --graph_name "graph.networkit" \
    --dedup_name "oscar_gl_dedup" \
    --column "text" \
    --ngram 1 \
    --num_perm 128 \
    --threshold 0.8 \
    --seed 42

# output
INFO     All                           : 52.73s
INFO     Loading                       : 5.32s
INFO     Minhash                       : 12.82s
INFO     Index                         : 8.54s
INFO     Save Index                    : 3.86s
INFO     Query                         : 4.49s
INFO     Clustering                    : 17.47s
INFO     Deduplicate                   : 0.05s
INFO     Save                          : 0.04s
INFO     Before                        : 88803
INFO     After                         : 43971
INFO     Index                         : output/minhash/lsh.pkl
INFO     Graph                         : output/minhash/graph.networkit
INFO     Output                        : output/minhash/oscar_gl_dedup

SimHash Near Deduplication

# input
python -m text_dedup.simhash \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output_dir "output/simhash" \
    --index_name "index.pkl" \
    --graph_name "graph.networkit" \
    --dedup_name "oscar_gl_dedup" \
    --column "text" \
    --ngram 6 \
    --bit_diff 3 \
    --num_bucket 4

# output
INFO     All                           : 39.88s
INFO     Loading                       : 4.45s
INFO     Simhash                       : 1.91s
INFO     Index                         : 5.23s
INFO     Save Index                    : 1.44s
INFO     Query                         : 6.57s
INFO     Clustering                    : 16.42s
INFO     Deduplicate                   : 0.72s
INFO     Save                          : 3.11s
INFO     Before                        : 88803
INFO     After                         : 46659
INFO     Index                         : output/simhash/index.pkl
INFO     Graph                         : output/simhash/graph.networkit
INFO     Output                        : output/simhash/oscar_gl_dedup

Exact Hash Exact Deduplication

# input
python -m text_dedup.exact_hash \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output_dir "output/exact_hash" \
    --dedup_name "oscar_gl_dedup" \
    --column "text"

# output
INFO     All                           : 5.34s
INFO     Loading                       : 4.48s
INFO     Processing                    : 0.73s
INFO     Filtering                     : 0.07s
INFO     Saving                        : 0.05s
INFO     Before                        : 88803
INFO     After                         : 47049

Bloom Filter Exact Deduplication

# input
python -m text_dedup.bloom_filter \
    --path "oscar-corpus/OSCAR-2201" \
    --name "gl" \
    --split "train" \
    --cache_dir "./cache" \
    --output_dir "output/bloom_filter" \
    --dedup_name "oscar_gl_dedup" \
    --error_rate 1e-5 \
    --column "text"

# output
INFO     All                           : 10.69s
INFO     Loading                       : 4.44s
INFO     Processing                    : 6.13s
INFO     Filtering                     : 0.07s
INFO     Saving                        : 0.05s
INFO     Before                        : 88803
INFO     After                         : 47045

Documentation

  • TODO

Roadmap

FAQ

Why use scripts instead of OOD classes and functions?

Early versions of the code uses object-oriented design for hashing and indexing, which was very difficult because different methods share little to no abstraction. In order to complie something that is useful, a lot of the wrapper code was used, and that actually increased the overhead of using this library. Additionally, deduplicating is often a one-time thing in data preprocessing pipeline, there isn't really a need for inline access.

Why license change?

Because the google repo is licensed under Apache 2.0, I have to update from MIT. Util that part of code is completely re-implemented, Apache 2.0. will be the license I use.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_dedup-0.3.0.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

text_dedup-0.3.0-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file text_dedup-0.3.0.tar.gz.

File metadata

  • Download URL: text_dedup-0.3.0.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.6 Darwin/22.2.0

File hashes

Hashes for text_dedup-0.3.0.tar.gz
Algorithm Hash digest
SHA256 fb5b754382267f93adf3f31b046ab3108fd2628b3e797df6b979a962e152f575
MD5 ca34855fa9024d2ec5885c08cbda094e
BLAKE2b-256 6708989fe1238ff8baf38d136da37d18247888897e5de5c42c14b257e2af857c

See more details on using hashes here.

File details

Details for the file text_dedup-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: text_dedup-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 23.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.2.2 CPython/3.10.6 Darwin/22.2.0

File hashes

Hashes for text_dedup-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9e45256033c0ee3a728686b297139af2243bc76d412c857ea8840729930c9343
MD5 d432b1b2613561adc2f09046f88c0c79
BLAKE2b-256 19ca28d3b5dd15154ebf24a8618523febc277b83b8834487175ca73ac0a40fc7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page