Text deduplication with fuzzy match and more
Project description
text-dedup (WIP)
Text deduplication with fuzzy match and more
Usage
- Group near duplicates
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import group_duplicates
df = pd.read_csv(...)
df_groups = group_duplicates(
df,
deduper=EditDistanceSimilarityDeduper(
similarity_metric="cosine",
threshold=0.8,
k=3),
column="text",
target_column="__group_label__"
)
df["__group_label__"].value_counts(dropna=False)
- Remove near duplicates
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import drop_duplicates
df = pd.read_csv(...)
df_dedup = drop_duplicates(
df,
deduper=EditDistanceSimilarityDeduper(
similarity_metric="cosine",
threshold=0.8,
k=3),
column="text"
)
assert df.shape != df_dedup.shape
- Remove semantically similar duplicates
import pandas as pd
from text_dedup.dedupers import PretrainedBERTEmbeddingDeduper
from text_dedup import drop_duplicates
df = pd.read_csv(...)
data_dedup = drop_duplicates(
df,
deduper=PretrainedBERTEmbeddingDeduper(
model='paraphrase-distilroberta-base-v1',
threshold=threshold,
),
column="text"
)
Installation
pip install text-dedup
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.0.4.tar.gz
(4.6 kB
view hashes)
Built Distribution
Close
Hashes for text_dedup-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6d03ec9ea80ee07f02ab249ce74359100be0bc4d1325037c3fd29f602b7e317d |
|
MD5 | d3ea70f5f4d86c7b5174d929e9c30c44 |
|
BLAKE2b-256 | e5fdc91b0040334473f8cddd57ba9b666a952ae1678aa13a3af17324f3fcb0c1 |