Skip to main content

No project description provided

Project description

PolyDeDupe: Multi-Lingual Data Deduplication

PolyDeDupe is a Python package designed for efficient and effective data deduplication across multiple languages. With support for over 100 languages, this tool stands out in its ability to perform both syntactic and semantic deduplication, ensuring high-quality data preprocessing for various NLP tasks.

Installation

PolyDeDupe can be installed using pip:

pip install polydedupe

Usage

from PolyDeDupe import deduplicate_dataset, display_dataset_entries
from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca",split="train")
newdataset = dataset.map(build_dataset, num_proc=16, remove_columns=dataset.column_names)
ds_dedup, duplicate_clusters = deduplicate_dataset(newdataset, jaccard_threshold=0.90)
display_dataset_entries(newdataset, duplicate_clusters)

Output:

Original dataset size: 52002
Number of duplicate clusters: 40
Files in duplicate cluster: 82
Unique files in duplicate cluster: 52
Filtered dataset size: 51972
Cluster:
Base Index: 1482, Data: {'instruction': 'Find the five largest cities in France.', 'input': '', 'output': 'The five largest cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nFind the five largest cities in France.\n\n### Response:\nThe five largest cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.'}
Base Index: 1820, Data: {'instruction': 'Name five cities in France.', 'input': '', 'output': 'The five cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nName five cities in France.\n\n### Response:\nThe five cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.'}

Citation:

@misc{Bhatia,
    title={PolyDeDupe: Multi-Lingual Data Deduplication},
    url={https://github.com/gagan3012/PolyDeDupe},
    journal={GitHub},
    publisher={gagan3012}, 
    author={Bhatia, Gagan}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PolyDeDupe-0.8.0.tar.gz (17.5 kB view details)

Uploaded Source

Built Distribution

PolyDeDupe-0.8.0-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file PolyDeDupe-0.8.0.tar.gz.

File metadata

  • Download URL: PolyDeDupe-0.8.0.tar.gz
  • Upload date:
  • Size: 17.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for PolyDeDupe-0.8.0.tar.gz
Algorithm Hash digest
SHA256 0b3dc7b6dcea4eaf23eeb9a9e99439882c3f1ff8dc623f7f76bdc7645dcc6867
MD5 f96ee019fb1631207cf87b22f2269663
BLAKE2b-256 f594be273354ad5535167e265617d6a01f75f2a051efdaa62ca59c6d4fc2a6a7

See more details on using hashes here.

File details

Details for the file PolyDeDupe-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: PolyDeDupe-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.13

File hashes

Hashes for PolyDeDupe-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 59c81d91805e273b49dc48399f3faf356c9cf6c2467001e5fecb9a2a64775fa9
MD5 fb07de861229152b25e6570ed9b7d09d
BLAKE2b-256 eb7cf03f0d74cfc93fa2f7817c56a0e9ed141b7ad19b98215453823c197fe65b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page