Skip to main content

Tools for semantic cleaning of a test dataset

Project description

semantic-cleaning

Open In Colab

Install

pip install semantic-cleaning

How to use

from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel
from semantic_cleaning import  preprocess_data,compute_embeddings, deduplicate_embeddings, deduplicate_dataset

Processing a dataset to get a sentence for QA or comment and response etc.

data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
2

Compute the embadding fot the sentences

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature  = '_merged'):

We can get the indicis of all the duplicated lines with the folowing command:

to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)

You could also find duplication between two datasets or splits like this:

to_delete = deduplicate_embeddings(embedded =embeddeing, embedded2 =embeddeing2, epsilon=1e-2, batch_size=20000)

The full process could be run like this

deduplicated = deduplicate_dataset(
    dataset = data['train'], 
    model = model, 
    tokenizer = tokenizer,
    epsilon = 1e-2, 
    model_batch_size = 64, 
    deduplication_batch_size = 20000, 
    num_workers = 16,
    dataset_feature = '_merged'
)
print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")

And deduplicated can be pushed back to the hub or saved on local drive

Command-Line Interface

The semantic cleaning module also includes a command-line interface that can be used to deduplicate datasets:

python semantic-cleaning.py \
  --model_path "sentence-transformers/all-mpnet-base-v2" \
  --tokenizer_path "sentence-transformers/all-mpnet-base-v2" \
  --dataset_path "0-hero/OIG-small-chip2" \
  --output_path "./deduplicated_imdb"

The following arguments are available:

  • –dataset_path: Path to the dataset to be deduplicated.
  • –model_path: The model checkpoint for embeddings. Should be a path or model id in HuggingFace model hub.
  • –tokenizer_path: The tokenizer to be used.
  • –epsilon: Threshold for cosine similarity to consider embeddings as duplicates.
  • –model_batch_size: Batch size for the model.
  • –deduplication_batch_size: Batch size for the deduplication process.
  • –num_workers: Number of worker processes for data loading.
  • –dataset_feature: Feature in the dataset to be used for deduplication.
  • –output_path: Path to save the deduplicated dataset. Can be a local path or a HuggingFace dataset repository.
  • –hub_repo: Repository on the Hugging Face hub to push the dataset.
  • –hub_token: HuggingFace Hub token to push the dataset to the Hub. Required when hub_repo is provided.
  • –device: Device to use for computations (e.g., ‘cpu’, ‘cuda’, ‘cuda:1’). If not provided, it will use CUDA if available, otherwise CPU.

You can use the –help flag to get a description of all options:

python semantic-cleaning.py --help

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic-cleaning-0.0.4.tar.gz (31.2 kB view hashes)

Uploaded Source

Built Distribution

semantic_cleaning-0.0.4-py3-none-any.whl (11.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page