Skip to main content

Generative deduplication

Project description

Generative Deduplication (Findings of EMNLP 2024)

Revisiting data deuplication, we propose a fresh paradigm for semantic deduplication.

Core Idea:

  • Generative language models possess powerful language understanding capabilities. We use it for semantic deduplication.
  • There are two crucial stages in generative deduplication:
    • Memory stage: The model learns the relationship between context and corresponding keywords. Semantically duplicate contexts establish stronger connections than non-duplicate ones in one-epoch training. $$g(y|context)$$
    • Inference stage: During inference, we use the trained generative model to generate keywords from the given context. If the generated keywords match the target keywords, we classify the data as duplicate. $$g(context) == y?$$

Installation

python -m pip install gen-dedup

Usage

from datasets import load_dataset
from keybert import KeyBERT
from gen_dedup import GenDedup

# 1. Load dataset
ds = load_dataset('cardiffnlp/tweet_eval', 'hate', split='train')
ds = ds.select_columns(['text'])
ds = ds.rename_column('text', 'sentence')

# 2. Generate keywords with KeyBERT. Other keyword extraction models can also be used.
keybert = KeyBERT()
# Here, we generate two keywords.
max_label_words = 2
ds = ds.map(lambda x: {
    'labels': " ".join([k[0] for k in keybert.extract_keywords(x['sentence'].lower())[:max_label_words]]),
    'sentence': x['sentence'].lower()})

# 3. Fit the generative model to learn g(y|X)
gd = GenDedup('google/flan-t5-small')
gd.fit(ds, output_dir='./hate-dedup')

# 4. Inference as Deduplication. Check whether g(X) = y
gd.deduplicate('./hate-dedup', max_label_words=max_label_words)

The trained model, duplicate data, and non-duplicate data will be saved in the ./hate-dedup directory.

Citation

@article{li2024generative,
  title={Generative Deduplication For Socia Media Data Selection},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2401.05883},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gen-dedup-0.0.2.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gen_dedup-0.0.2-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file gen-dedup-0.0.2.tar.gz.

File metadata

  • Download URL: gen-dedup-0.0.2.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for gen-dedup-0.0.2.tar.gz
Algorithm Hash digest
SHA256 7e4900983a3fcea6bf940e5ed9e00b1dce59da8346711156fb1fe0c851c9e71c
MD5 94df58034e44031e1bd6fce0e9177783
BLAKE2b-256 fa167d3663d83e1f1de61e250d8c00d70384e7643371265e7395a6da8ddefbe0

See more details on using hashes here.

File details

Details for the file gen_dedup-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: gen_dedup-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.5

File hashes

Hashes for gen_dedup-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7877ca4e05164f95306bed9742e16a5f7cd931d07fbb2bc007a99890d266390
MD5 12805e2957e5a5a993d8f4c1256610d8
BLAKE2b-256 074c7e2bf7b04a027b839eae71b7dd6e1df48df25115cb1caf46227bd8f1de2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page