Generative deduplication
Project description
Generative Deduplication (Findings of EMNLP 2024)
Revisiting data deuplication, we propose a fresh paradigm for semantic deduplication.
Core Idea:
- Generative language models possess powerful language understanding capabilities. We use it for semantic deduplication.
- There are two crucial stages in generative deduplication:
- Memory stage: The model learns the relationship between context and corresponding keywords. Semantically duplicate contexts establish stronger connections than non-duplicate ones in one-epoch training. $$g(y|context)$$
- Inference stage: During inference, we use the trained generative model to generate keywords from the given context. If the generated keywords match the target keywords, we classify the data as duplicate. $$g(context) == y?$$
Installation
python -m pip install gen-dedup
Usage
from datasets import load_dataset
from keybert import KeyBERT
from gen_dedup import GenDedup
# 1. Load dataset
ds = load_dataset('cardiffnlp/tweet_eval', 'hate', split='train')
ds = ds.select_columns(['text'])
ds = ds.rename_column('text', 'sentence')
# 2. Generate keywords with KeyBERT. Other keyword extraction models can also be used.
keybert = KeyBERT()
# Here, we generate two keywords.
max_label_words = 2
ds = ds.map(lambda x: {
'labels': " ".join([k[0] for k in keybert.extract_keywords(x['sentence'].lower())[:max_label_words]]),
'sentence': x['sentence'].lower()})
# 3. Fit the generative model to learn g(y|X)
gd = GenDedup('google/flan-t5-small')
gd.fit(ds, output_dir='./hate-dedup')
# 4. Inference as Deduplication. Check whether g(X) = y
gd.deduplicate('./hate-dedup', max_label_words=max_label_words)
The trained model, duplicate data, and non-duplicate data will be saved in the ./hate-dedup
directory.
Citation
@article{li2024generative,
title={Generative Deduplication For Socia Media Data Selection},
author={Li, Xianming and Li, Jing},
journal={arXiv preprint arXiv:2401.05883},
year={2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gen-dedup-0.0.2.tar.gz
(10.2 kB
view details)
Built Distribution
File details
Details for the file gen-dedup-0.0.2.tar.gz
.
File metadata
- Download URL: gen-dedup-0.0.2.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e4900983a3fcea6bf940e5ed9e00b1dce59da8346711156fb1fe0c851c9e71c |
|
MD5 | 94df58034e44031e1bd6fce0e9177783 |
|
BLAKE2b-256 | fa167d3663d83e1f1de61e250d8c00d70384e7643371265e7395a6da8ddefbe0 |
File details
Details for the file gen_dedup-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: gen_dedup-0.0.2-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f7877ca4e05164f95306bed9742e16a5f7cd931d07fbb2bc007a99890d266390 |
|
MD5 | 12805e2957e5a5a993d8f4c1256610d8 |
|
BLAKE2b-256 | 074c7e2bf7b04a027b839eae71b7dd6e1df48df25115cb1caf46227bd8f1de2e |