Text deduplication with fuzzy match and more
Project description
text-dedup
Text deduplication with fuzzy match and more. (WIP)
Usage
- Group near duplicates
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import group_duplicates
df = pd.read_csv(...)
df_groups = group_duplicates(
df,
deduper=EditDistanceSimilarityDeduper(
similarity_metric="cosine",
threshold=0.8,
k=3),
column="text",
target_column="__group_label__"
)
df["__group_label__"].value_counts(dropna=False)
- Remove near duplicates
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import drop_duplicates
df = pd.read_csv(...)
df_dedup = drop_duplicates(
df,
deduper=EditDistanceSimilarityDeduper(
similarity_metric="cosine",
threshold=0.8,
k=3),
column="text"
)
assert df.shape != df_dedup.shape
- Remove semantically similar duplicates
import pandas as pd
from text_dedup.dedupers import PretrainedBERTEmbeddingDeduper
from text_dedup import drop_duplicates
df = pd.read_csv(...)
data_dedup = drop_duplicates(
df,
deduper=PretrainedBERTEmbeddingDeduper(
model='paraphrase-distilroberta-base-v1',
threshold=threshold,
),
column="text"
)
Installation
pip install text-dedup
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.0.5.tar.gz
(4.7 kB
view hashes)
Built Distribution
Close
Hashes for text_dedup-0.0.5-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 03c2ee5cee02a9e518a9cc22a2a421148c3b85ca71060b2f170c7725d83b39b6 |
|
MD5 | 5cf198ebfcd235e4fab83b39134536ed |
|
BLAKE2b-256 | 479239858e613b6a32eae3925335ecf4fd835385ed938ae68132f94a98650250 |