Text deduplication with fuzzy match and more
Project description
text-dedup
Text de-duplication with edit distance, LSH or embeddings. (WIP)
Usage
- Group near duplicates
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import group_duplicates
df = pd.read_csv(...)
df_groups = group_duplicates(
df,
deduper=EditDistanceSimilarityDeduper(
similarity_metric="cosine",
threshold=0.8,
k=3),
column="text",
target_column="__group_label__"
)
df["__group_label__"].value_counts(dropna=False)
- Remove near duplicates
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import drop_duplicates
df = pd.read_csv(...)
df_dedup = drop_duplicates(
df,
deduper=EditDistanceSimilarityDeduper(
similarity_metric="cosine",
threshold=0.8,
k=3),
column="text"
)
assert df.shape != df_dedup.shape
- Remove semantically similar duplicates
import pandas as pd
from text_dedup.dedupers import PretrainedBERTEmbeddingDeduper
from text_dedup import drop_duplicates
df = pd.read_csv(...)
data_dedup = drop_duplicates(
df,
deduper=PretrainedBERTEmbeddingDeduper(
model='paraphrase-distilroberta-base-v1',
threshold=threshold,
),
column="text"
)
Installation
pip install text-dedup
Benchmarks
LSH
------------------------------------------------ benchmark: 1 tests ------------------------------------------------
Name (time in ms) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------------
test_performance3 767.0355 846.3728 803.1992 31.7007 798.3628 50.2480 2;0 1.2450 5 5
--------------------------------------------------------------------------------------------------------------------
EditDistance
--------------------------------------------- benchmark: 1 tests ---------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
--------------------------------------------------------------------------------------------------------------
test_performance2 10.7813 11.7912 11.2641 0.3861 11.1549 0.5356 2;0 0.0888 5 5
--------------------------------------------------------------------------------------------------------------
BERT
-------------------------------------------- benchmark: 1 tests -------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------
test_performance1 8.0105 10.8614 9.4974 1.2967 9.1050 2.3446 3;0 0.1053 5 5
-----------------------------------------------------------------------------------------------------------
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.0.6.tar.gz
(5.6 kB
view hashes)
Built Distribution
Close
Hashes for text_dedup-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 470dd04d62c72a0acd00c0651511835c2abff340b314aa5601108354ad3c31d5 |
|
MD5 | 310451417577d8b33e11577e2bbe8adf |
|
BLAKE2b-256 | df4c8075742d9ffbfeb69d64773604aa9431e078871f97efa87be1250cbbf8a7 |