Skip to main content

Generative deduplication

Project description

Generative Deduplication (Findings of EMNLP 2024)

Revisiting data deuplication, we propose a fresh paradigm for semantic deduplication.

Core Idea:

  • Generative language models possess powerful language understanding capabilities. We use it for semantic deduplication.
  • There are two crucial stages in generative deduplication:
    • Memory stage: The model learns the relationship between context and corresponding keywords. Semantically duplicate contexts establish stronger connections than non-duplicate ones in one-epoch training. $$g(y|context)$$
    • Inference stage: During inference, we use the trained generative model to generate keywords from the given context. If the generated keywords match the target keywords, we classify the data as duplicate. $$g(context) == y?$$

Installation

python -m pip install gen-dedup

Usage

from datasets import load_dataset
from keybert import KeyBERT
from gen_dedup import GenDedup

# 1. Load dataset
ds = load_dataset('cardiffnlp/tweet_eval', 'hate', split='train')
ds = ds.select_columns(['text'])
ds = ds.rename_column('text', 'sentence')

# 2. Generate keywords with KeyBERT. Other keyword extraction models can also be used.
keybert = KeyBERT()
# Here, we generate two keywords.
max_label_words = 2
ds = ds.map(lambda x: {
    'labels': " ".join([k[0] for k in keybert.extract_keywords(x['sentence'].lower())[:max_label_words]]),
    'sentence': x['sentence'].lower()})

# 3. Fit the generative model to learn g(y|X)
gd = GenDedup('google/flan-t5-small')
gd.fit(ds, output_dir='./hate-dedup')

# 4. Inference as Deduplication. Check whether g(X) = y
gd.deduplicate('./hate-dedup', max_label_words=max_label_words)

The trained model, duplicate data, and non-duplicate data will be saved in the ./hate-dedup directory.

Citation

@article{li2024generative,
  title={Generative Deduplication For Socia Media Data Selection},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2401.05883},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gen-dedup-0.0.1.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

gen_dedup-0.0.1-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file gen-dedup-0.0.1.tar.gz.

File metadata

  • Download URL: gen-dedup-0.0.1.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for gen-dedup-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ba7aba155dd054ecabb6c02e7950cfe2b1ae2d443ce2f304b832d6cd9681de78
MD5 e55fb51a3b8a66cdf4d5a5d8d83a12e9
BLAKE2b-256 4954159bb65d1dc068570d940b92a26f1b537f0e0e5385809fe8e79109d14037

See more details on using hashes here.

File details

Details for the file gen_dedup-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: gen_dedup-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for gen_dedup-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f3080c66f31bb361c1d55c4df73ca458f89b86761ff10c0680249780ff27cf3e
MD5 423f3f9030496651337ef3edf3575ca5
BLAKE2b-256 16723d7aa1ba97a5de03691e9a87ba0f4e5d3e2ee20299632356bd9c3a8a4285

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page