All-in-one text deduplication

These details have not been verified by PyPI

Project description

text-dedup

Features

Hash-based methods such as SimHash, MinHash + LSH for near deduplication.
SuffixArray-based method from Deduplicating Training Data Makes Language Models Better for substring exact deduplication.
In-memory or Redis/KeyDB-cached index to handle larger than memory datasets.

Documentation

Github Pages

Todos

Memory benchmark for streaming processing
Speed benchmark for in-memory processing
Inter-dataset deduplication
Rewrite suffix array in Python

Thanks

seomoz/simhash-cpp
datasketch
google-research/deduplicate-text-datasets
Developed with OSS license from JetBrains
This project is heavily influenced by the deduplication work at BigScience workshop. The original code can be found at bigscience-workshop/data-preparation.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.0

Apr 17, 2024

0.3.1

Mar 25, 2023

0.3.0

Nov 5, 2022

0.2.1

Sep 29, 2022

This version

0.2.0

Sep 24, 2022

0.1.1

Sep 4, 2022

0.1.0

Aug 27, 2022

0.0.18

Jun 20, 2022

0.0.17

Jun 15, 2022

0.0.16

Jun 14, 2022

0.0.15

May 29, 2022

0.0.14

May 29, 2022

0.0.13

Apr 2, 2022

0.0.12

Dec 25, 2021

0.0.11

Jul 24, 2021

0.0.10

Jul 24, 2021

0.0.9

Jun 5, 2021

0.0.7

Apr 14, 2021

0.0.6

Apr 11, 2021

0.0.5

Apr 3, 2021

0.0.4

Mar 29, 2021

0.0.3

Mar 14, 2021

0.0.2

Mar 14, 2021

0.0.1

Mar 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-dedup-0.2.0.tar.gz (79.8 kB view hashes)

Uploaded Sep 24, 2022 Source

Built Distribution

text_dedup-0.2.0-py3-none-any.whl (34.4 kB view hashes)

Uploaded Sep 24, 2022 Python 3

Hashes for text-dedup-0.2.0.tar.gz

Hashes for text-dedup-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`36184e9b57ffaf3d434764b9f577cc0f4846ce0d0a913bc49ca610707d1a3d6a`
MD5	`d23517cdf9e3a51420750a21facb44b5`
BLAKE2b-256	`72d17ce41a5953fb019f89301d9562e10d2bfa74dfac35acff7b463349724e44`

Hashes for text_dedup-0.2.0-py3-none-any.whl

Hashes for text_dedup-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b47b1e177241425da8ea09fa7484c63de131b08f708f78e00f377582eb03b965`
MD5	`a839715af39f3c0db6ff0ab358f38454`
BLAKE2b-256	`a5dfe1b786f02446b587d7610257b642a3074484bdd6eb2ff177c1fdd6ba2f41`