Text deduplication with fuzzy match and more

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

text-dedup

Features

SOTA embeddings with sentence-transformer
Fast de-duplication with annoy
Suffix Array and MinHash Deduplicating Training Data Makes Language Models Better

Installation

pip install text-dedup

Usage

Using Sentence Transformer

from text_dedup import SentenceTransformerDeduper

df = pd.read_csv('...')

deduper = SentenceTransformerDeduper("distilbert-base-nli-stsb-mean-tokens")
df["group"] = deduper.group(df["text"].values.tolist(), show_progress_bar=True)

# dedup with group indices
df = df.drop_duplicates(["group"], keep="first")

Using Suffix Array for exact match

from text_dedup import SuffixArray

df = pd.read_csv('...')

deduper = SuffixArray(k=50)
groups, duplicates = deduper.fit_transform(df["text"].values.tolist())

assert len(groups) == len(df), "Invalid number of rows"
assert len(duplicates) == groups.shape[1], "Invalid number of columns"

Using MinHash for fuzzy match

from text_dedup import MinHashDeduper
deduper = MinHashDeduper(ngram_size=5, threshold=0.3)
groups = deduper.fit_transform(["This is a sentence.", "This is another sentence.", "This is a question.", "hello world"])
assert groups == [0, 0, 2, 3]

Benchmark (w/ a P100)

20k(5%) QQP subset

              precision    recall  f1-score   support

       False       0.75      0.89      0.81     12671
        True       0.73      0.50      0.60      7543

    accuracy                           0.75     20214
   macro avg       0.74      0.70      0.71     20214
weighted avg       0.74      0.75      0.73     20214


--------------------------------------------- benchmark: 1 tests --------------------------------------------
Name (time in s)         Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
-------------------------------------------------------------------------------------------------------------
test_scaling         89.9221  89.9221  89.9221  0.0000  89.9221  0.0000       0;0  0.0111       1          10
-------------------------------------------------------------------------------------------------------------

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.4.0

Apr 17, 2024

0.3.1

Mar 25, 2023

0.3.0

Nov 5, 2022

0.2.1

Sep 29, 2022

0.2.0

Sep 24, 2022

0.1.1

Sep 4, 2022

0.1.0

Aug 27, 2022

0.0.18

Jun 20, 2022

0.0.17

Jun 15, 2022

0.0.16

Jun 14, 2022

0.0.15

May 29, 2022

0.0.14

May 29, 2022

0.0.13

Apr 2, 2022

This version

0.0.12

Dec 25, 2021

0.0.11

Jul 24, 2021

0.0.10

Jul 24, 2021

0.0.9

Jun 5, 2021

0.0.7

Apr 14, 2021

0.0.6

Apr 11, 2021

0.0.5

Apr 3, 2021

0.0.4

Mar 29, 2021

0.0.3

Mar 14, 2021

0.0.2

Mar 14, 2021

0.0.1

Mar 14, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text-dedup-0.0.12.tar.gz (8.2 kB view hashes)

Uploaded Dec 25, 2021 Source

Built Distribution

text_dedup-0.0.12-py3-none-any.whl (10.0 kB view hashes)

Uploaded Dec 25, 2021 Python 3

Hashes for text-dedup-0.0.12.tar.gz

Hashes for text-dedup-0.0.12.tar.gz
Algorithm	Hash digest
SHA256	`4c95c832bd568ff6b97292ffcdfab1fc1d9432044477e1943eaf8f00e15738c6`
MD5	`88e6ead190612514271903f23757a8b9`
BLAKE2b-256	`c82bba684330b32f2bb528d1ad9062d4498985660dacb71cb2ee568fc44b6402`

Hashes for text_dedup-0.0.12-py3-none-any.whl

Hashes for text_dedup-0.0.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8527bfa25de84e49972af34a5376069e5a04c698e4a8e32efce9d48afe543d3d`
MD5	`107307b3383a2278019b9e0819ebdcda`
BLAKE2b-256	`8f8be8236e0916cf88aa86b47fa45f8d9ccb341a7e4ec83a0672d4659502a253`