Generative deduplication
Project description
Generative Deduplication (Findings of EMNLP 2024)
Revisiting data deuplication, we propose a fresh paradigm for semantic deduplication.
Core Idea:
- Generative language models possess powerful language understanding capabilities. We use it for semantic deduplication.
- There are two crucial stages in generative deduplication:
- Memory stage: The model learns the relationship between context and corresponding keywords. Semantically duplicate contexts establish stronger connections than non-duplicate ones in one-epoch training. $$g(y|context)$$
- Inference stage: During inference, we use the trained generative model to generate keywords from the given context. If the generated keywords match the target keywords, we classify the data as duplicate. $$g(context) == y?$$
Installation
python -m pip install gen-dedup
Usage
from datasets import load_dataset
from keybert import KeyBERT
from gen_dedup import GenDedup
# 1. Load dataset
ds = load_dataset('cardiffnlp/tweet_eval', 'hate', split='train')
ds = ds.select_columns(['text'])
ds = ds.rename_column('text', 'sentence')
# 2. Generate keywords with KeyBERT. Other keyword extraction models can also be used.
keybert = KeyBERT()
# Here, we generate two keywords.
max_label_words = 2
ds = ds.map(lambda x: {
'labels': " ".join([k[0] for k in keybert.extract_keywords(x['sentence'].lower())[:max_label_words]]),
'sentence': x['sentence'].lower()})
# 3. Fit the generative model to learn g(y|X)
gd = GenDedup('google/flan-t5-small')
gd.fit(ds, output_dir='./hate-dedup')
# 4. Inference as Deduplication. Check whether g(X) = y
gd.deduplicate('./hate-dedup', max_label_words=max_label_words)
The trained model, duplicate data, and non-duplicate data will be saved in the ./hate-dedup
directory.
Citation
@article{li2024generative,
title={Generative Deduplication For Socia Media Data Selection},
author={Li, Xianming and Li, Jing},
journal={arXiv preprint arXiv:2401.05883},
year={2024}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gen-dedup-0.0.1.tar.gz
(10.2 kB
view details)
Built Distribution
File details
Details for the file gen-dedup-0.0.1.tar.gz
.
File metadata
- Download URL: gen-dedup-0.0.1.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ba7aba155dd054ecabb6c02e7950cfe2b1ae2d443ce2f304b832d6cd9681de78 |
|
MD5 | e55fb51a3b8a66cdf4d5a5d8d83a12e9 |
|
BLAKE2b-256 | 4954159bb65d1dc068570d940b92a26f1b537f0e0e5385809fe8e79109d14037 |
File details
Details for the file gen_dedup-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: gen_dedup-0.0.1-py3-none-any.whl
- Upload date:
- Size: 9.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f3080c66f31bb361c1d55c4df73ca458f89b86761ff10c0680249780ff27cf3e |
|
MD5 | 423f3f9030496651337ef3edf3575ca5 |
|
BLAKE2b-256 | 16723d7aa1ba97a5de03691e9a87ba0f4e5d3e2ee20299632356bd9c3a8a4285 |