Skip to main content

Generative deduplication

Project description

Generative Deduplication (Findings of EMNLP 2024)

Revisiting data deuplication, we propose a fresh paradigm for semantic deduplication.

Core Idea:

  • Generative language models possess powerful language understanding capabilities. We use it for semantic deduplication.
  • There are two crucial stages in generative deduplication:
    • Memory stage: The model learns the relationship between context and corresponding keywords. Semantically duplicate contexts establish stronger connections than non-duplicate ones in one-epoch training. $$g(y|context)$$
    • Inference stage: During inference, we use the trained generative model to generate keywords from the given context. If the generated keywords match the target keywords, we classify the data as duplicate. $$g(context) == y?$$

Installation

python -m pip install gen-dedup

Usage

from datasets import load_dataset
from keybert import KeyBERT
from gen_dedup import GenDedup

# 1. Load dataset
ds = load_dataset('cardiffnlp/tweet_eval', 'hate', split='train')
ds = ds.select_columns(['text'])
ds = ds.rename_column('text', 'sentence')

# 2. Generate keywords with KeyBERT. Other keyword extraction models can also be used.
keybert = KeyBERT()
# Here, we generate two keywords.
max_label_words = 2
ds = ds.map(lambda x: {
    'labels': " ".join([k[0] for k in keybert.extract_keywords(x['sentence'].lower())[:max_label_words]]),
    'sentence': x['sentence'].lower()})

# 3. Fit the generative model to learn g(y|X)
gd = GenDedup('google/flan-t5-small')
gd.fit(ds, output_dir='./hate-dedup')

# 4. Inference as Deduplication. Check whether g(X) = y
gd.deduplicate('./hate-dedup', max_label_words=max_label_words)

The trained model, duplicate data, and non-duplicate data will be saved in the ./hate-dedup directory.

Citation

@article{li2024generative,
  title={Generative Deduplication For Socia Media Data Selection},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2401.05883},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gen-dedup-0.0.2.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

gen_dedup-0.0.2-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file gen-dedup-0.0.2.tar.gz.

File metadata

  • Download URL: gen-dedup-0.0.2.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for gen-dedup-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7e4900983a3fcea6bf940e5ed9e00b1dce59da8346711156fb1fe0c851c9e71c
MD5 94df58034e44031e1bd6fce0e9177783
BLAKE2b-256 fa167d3663d83e1f1de61e250d8c00d70384e7643371265e7395a6da8ddefbe0

See more details on using hashes here.

File details

Details for the file gen_dedup-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: gen_dedup-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for gen_dedup-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7877ca4e05164f95306bed9742e16a5f7cd931d07fbb2bc007a99890d266390
MD5 12805e2957e5a5a993d8f4c1256610d8
BLAKE2b-256 074c7e2bf7b04a027b839eae71b7dd6e1df48df25115cb1caf46227bd8f1de2e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page