Text deduplication with fuzzy match and more
Project description
text-dedup
Text deduplication with fuzzy match and more
Usage
- Group near duplicates
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import group_duplicates
df = pd.read_csv(...)
df_groups = group_duplicates(
df,
deduper=EditDistanceSimilarityDeduper(
similarity_metric="cosine",
threshold=0.8,
k=3),
column="text",
target_column="__group_label__"
)
df["__group_label__"].value_counts(dropna=False)
- Remove near duplicates
import pandas as pd
from text_dedup.dedupers import EditDistanceSimilarityDeduper
from text_dedup import drop_duplicates
df = pd.read_csv(...)
df_dedup = drop_duplicates(
df,
deduper=EditDistanceSimilarityDeduper(
similarity_metric="cosine",
threshold=0.8,
k=3),
column="text"
)
assert df.shape != df_dedup.shape
Installation
pip install text-dedup
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
text-dedup-0.0.3.tar.gz
(4.2 kB
view hashes)
Built Distribution
Close
Hashes for text_dedup-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9218e6eddf8599ccbce33b593335c2ef13be4e1c39b503dd31a2dae23978d29 |
|
MD5 | b63ec4f3cd8948653260d7615357dea1 |
|
BLAKE2b-256 | 4dcc30a2c41105d2624bf64b262789141125c3ef12590b171d908f178e1003d4 |