Tools for semantic cleaning of a test dataset
Project description
semantic-cleaning
Install
pip install semantic-cleaning
How to use
from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel
from semantic_cleaning import preprocess_data,compute_embeddings, deduplicate_embeddings, deduplicate_dataset
Processing a dataset to get a sentence for QA or comment and response etc.
data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
2
Compute the embadding fot the sentences
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature = '_merged'):
We can get the indicis of all the duplicated lines with the folowing command:
to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)
You could also find duplication between two datasets or splits like this:
to_delete = deduplicate_embeddings(embedded =embeddeing, embedded2 =embeddeing2, epsilon=1e-2, batch_size=20000)
The full process could be run like this
deduplicated = deduplicate_dataset(
dataset = data['train'],
model = model,
tokenizer = tokenizer,
epsilon = 1e-2,
model_batch_size = 64,
deduplication_batch_size = 20000,
num_workers = 16,
dataset_feature = '_merged'
)
print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")
And deduplicated can be pushed back to the hub or saved on local drive
Command-Line Interface
The semantic cleaning module also includes a command-line interface that can be used to deduplicate datasets:
python semantic-cleaning.py \
--model_path "sentence-transformers/all-mpnet-base-v2" \
--tokenizer_path "sentence-transformers/all-mpnet-base-v2" \
--dataset_path "0-hero/OIG-small-chip2" \
--output_path "./deduplicated_imdb"
The following arguments are available:
- –dataset_path: Path to the dataset to be deduplicated.
- –model_path: The model checkpoint for embeddings. Should be a path or model id in HuggingFace model hub.
- –tokenizer_path: The tokenizer to be used.
- –epsilon: Threshold for cosine similarity to consider embeddings as duplicates.
- –model_batch_size: Batch size for the model.
- –deduplication_batch_size: Batch size for the deduplication process.
- –num_workers: Number of worker processes for data loading.
- –dataset_feature: Feature in the dataset to be used for deduplication.
- –output_path: Path to save the deduplicated dataset. Can be a local path or a HuggingFace dataset repository.
- –hub_repo: Repository on the Hugging Face hub to push the dataset.
- –hub_token: HuggingFace Hub token to push the dataset to the Hub. Required when hub_repo is provided.
- –device: Device to use for computations (e.g., ‘cpu’, ‘cuda’, ‘cuda:1’). If not provided, it will use CUDA if available, otherwise CPU.
You can use the –help flag to get a description of all options:
python semantic-cleaning.py --help
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
semantic-cleaning-0.0.4.tar.gz
(31.2 kB
view hashes)
Built Distribution
Close
Hashes for semantic_cleaning-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 888004d0424d3d3fcaea3be41aef7d7541939152493a52ba66b9f50f8fdeb519 |
|
MD5 | e064a479cfce45836b87caa5a5c34293 |
|
BLAKE2b-256 | 7b764f2da82efe19375ef8ca324128c50a33137f514704fa9fd392219447748a |