Tools for semantic cleaning of a test dataset
Project description
semantic-cleaning
Install
pip install semantic-cleaning
How to use
from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel
from semantic_cleaning import preprocess_data,compute_embeddings, deduplicate_embeddings, deduplicate_dataset
Processing a dataset to get a sentence for QA or comment and response etc.
data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
2
Compute the embadding fot the sentences
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature = '_merged'):
We can get the indicis of all the duplicated lines with the folowing command:
to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)
You could also find duplication between two datasets or splits like this:
to_delete = deduplicate_embeddings(embedded =embeddeing, embedded2 =embeddeing2, epsilon=1e-2, batch_size=20000)
The full process could be run like this
deduplicated = deduplicate_dataset(
dataset = data['train'],
model = model,
tokenizer = tokenizer,
epsilon = 1e-2,
model_batch_size = 64,
deduplication_batch_size = 20000,
num_workers = 16,
dataset_feature = '_merged'
)
print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")
And deduplicated can be pushed back to the hub or saved on local drive
Command-Line Interface
The semantic cleaning module also includes a command-line interface that can be used to deduplicate datasets:
python semantic-cleaning.py \
--model_path "sentence-transformers/all-mpnet-base-v2" \
--tokenizer_path "sentence-transformers/all-mpnet-base-v2" \
--dataset_path "0-hero/OIG-small-chip2" \
--output_path "./deduplicated_imdb"
The following arguments are available:
- –dataset_path: Path to the dataset to be deduplicated.
- –model_path: The model checkpoint for embeddings. Should be a path or model id in HuggingFace model hub.
- –tokenizer_path: The tokenizer to be used.
- –epsilon: Threshold for cosine similarity to consider embeddings as duplicates.
- –model_batch_size: Batch size for the model.
- –deduplication_batch_size: Batch size for the deduplication process.
- –num_workers: Number of worker processes for data loading.
- –dataset_feature: Feature in the dataset to be used for deduplication.
- –output_path: Path to save the deduplicated dataset. Can be a local path or a HuggingFace dataset repository.
- –hub_repo: Repository on the Hugging Face hub to push the dataset.
- –hub_token: HuggingFace Hub token to push the dataset to the Hub. Required when hub_repo is provided.
- –device: Device to use for computations (e.g., ‘cpu’, ‘cuda’, ‘cuda:1’). If not provided, it will use CUDA if available, otherwise CPU.
You can use the –help flag to get a description of all options:
python semantic-cleaning.py --help
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file semantic-cleaning-0.0.4.tar.gz
.
File metadata
- Download URL: semantic-cleaning-0.0.4.tar.gz
- Upload date:
- Size: 31.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e266112729ac68b70b76ea6f4dc787ff1f7c517b6915b2c9a6ae00b6635daf6d |
|
MD5 | b1ed3a655f2dbce3156e99a36129527f |
|
BLAKE2b-256 | 8d30d74cc9ac94647eeb1cf405bffa0e8bd29eb6b92b20f09e1da1224fe5ff12 |
File details
Details for the file semantic_cleaning-0.0.4-py3-none-any.whl
.
File metadata
- Download URL: semantic_cleaning-0.0.4-py3-none-any.whl
- Upload date:
- Size: 11.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 888004d0424d3d3fcaea3be41aef7d7541939152493a52ba66b9f50f8fdeb519 |
|
MD5 | e064a479cfce45836b87caa5a5c34293 |
|
BLAKE2b-256 | 7b764f2da82efe19375ef8ca324128c50a33137f514704fa9fd392219447748a |