No project description provided
Project description
PolyDeDupe: Multi-Lingual Data Deduplication
PolyDeDupe is a Python package designed for efficient and effective data deduplication across multiple languages. With support for over 100 languages, this tool stands out in its ability to perform both syntactic and semantic deduplication, ensuring high-quality data preprocessing for various NLP tasks.
Installation
PolyDeDupe can be installed using pip:
pip install polydedupe
Usage
from PolyDeDupe import deduplicate_dataset, display_dataset_entries
from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca",split="train")
newdataset = dataset.map(build_dataset, num_proc=16, remove_columns=dataset.column_names)
ds_dedup, duplicate_clusters = deduplicate_dataset(newdataset, jaccard_threshold=0.90)
display_dataset_entries(newdataset, duplicate_clusters)
Output:
Original dataset size: 52002
Number of duplicate clusters: 40
Files in duplicate cluster: 82
Unique files in duplicate cluster: 52
Filtered dataset size: 51972
Cluster:
Base Index: 1482, Data: {'instruction': 'Find the five largest cities in France.', 'input': '', 'output': 'The five largest cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nFind the five largest cities in France.\n\n### Response:\nThe five largest cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.'}
Base Index: 1820, Data: {'instruction': 'Name five cities in France.', 'input': '', 'output': 'The five cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nName five cities in France.\n\n### Response:\nThe five cities in France are Paris, Marseille, Lyon, Toulouse, and Nice.'}
Citation:
@misc{Bhatia,
title={PolyDeDupe: Multi-Lingual Data Deduplication},
url={https://github.com/gagan3012/PolyDeDupe},
journal={GitHub},
publisher={gagan3012},
author={Bhatia, Gagan}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
PolyDeDupe-0.8.0.tar.gz
(17.5 kB
view details)
Built Distribution
File details
Details for the file PolyDeDupe-0.8.0.tar.gz
.
File metadata
- Download URL: PolyDeDupe-0.8.0.tar.gz
- Upload date:
- Size: 17.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b3dc7b6dcea4eaf23eeb9a9e99439882c3f1ff8dc623f7f76bdc7645dcc6867 |
|
MD5 | f96ee019fb1631207cf87b22f2269663 |
|
BLAKE2b-256 | f594be273354ad5535167e265617d6a01f75f2a051efdaa62ca59c6d4fc2a6a7 |
File details
Details for the file PolyDeDupe-0.8.0-py3-none-any.whl
.
File metadata
- Download URL: PolyDeDupe-0.8.0-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 59c81d91805e273b49dc48399f3faf356c9cf6c2467001e5fecb9a2a64775fa9 |
|
MD5 | fb07de861229152b25e6570ed9b7d09d |
|
BLAKE2b-256 | eb7cf03f0d74cfc93fa2f7817c56a0e9ed141b7ad19b98215453823c197fe65b |