Skip to main content

Tools for semantic cleaning of a test dataset

Project description

semantic-cleaning

Open In Colab

Install

pip install semantic-cleaning

How to use

from datasets import load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel
from semantic_cleaning import  preprocess_data,compute_embeddings, deduplicate_embeddings, deduplicate_dataset

Processing a dataset to get a sentence for QA or comment and response etc.

data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
2

Compute the embadding fot the sentences

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature  = '_merged'):

We can get the indicis of all the duplicated lines with the folowing command:

to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)

You could also find duplication between two datasets or splits like this:

to_delete = deduplicate_embeddings(embedded =embeddeing, embedded2 =embeddeing2, epsilon=1e-2, batch_size=20000)

The full process could be run like this

deduplicated = deduplicate_dataset(
    dataset = data['train'], 
    model = model, 
    tokenizer = tokenizer,
    epsilon = 1e-2, 
    model_batch_size = 64, 
    deduplication_batch_size = 20000, 
    num_workers = 16,
    dataset_feature = '_merged'
)
print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")

And deduplicated can be pushed back to the hub or saved on local drive

Command-Line Interface

The semantic cleaning module also includes a command-line interface that can be used to deduplicate datasets:

python semantic-cleaning.py \
  --model_path "sentence-transformers/all-mpnet-base-v2" \
  --tokenizer_path "sentence-transformers/all-mpnet-base-v2" \
  --dataset_path "0-hero/OIG-small-chip2" \
  --output_path "./deduplicated_imdb"

The following arguments are available:

  • –dataset_path: Path to the dataset to be deduplicated.
  • –model_path: The model checkpoint for embeddings. Should be a path or model id in HuggingFace model hub.
  • –tokenizer_path: The tokenizer to be used.
  • –epsilon: Threshold for cosine similarity to consider embeddings as duplicates.
  • –model_batch_size: Batch size for the model.
  • –deduplication_batch_size: Batch size for the deduplication process.
  • –num_workers: Number of worker processes for data loading.
  • –dataset_feature: Feature in the dataset to be used for deduplication.
  • –output_path: Path to save the deduplicated dataset. Can be a local path or a HuggingFace dataset repository.
  • –hub_repo: Repository on the Hugging Face hub to push the dataset.
  • –hub_token: HuggingFace Hub token to push the dataset to the Hub. Required when hub_repo is provided.
  • –device: Device to use for computations (e.g., ‘cpu’, ‘cuda’, ‘cuda:1’). If not provided, it will use CUDA if available, otherwise CPU.

You can use the –help flag to get a description of all options:

python semantic-cleaning.py --help

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic-cleaning-0.0.4.tar.gz (31.2 kB view details)

Uploaded Source

Built Distribution

semantic_cleaning-0.0.4-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file semantic-cleaning-0.0.4.tar.gz.

File metadata

  • Download URL: semantic-cleaning-0.0.4.tar.gz
  • Upload date:
  • Size: 31.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for semantic-cleaning-0.0.4.tar.gz
Algorithm Hash digest
SHA256 e266112729ac68b70b76ea6f4dc787ff1f7c517b6915b2c9a6ae00b6635daf6d
MD5 b1ed3a655f2dbce3156e99a36129527f
BLAKE2b-256 8d30d74cc9ac94647eeb1cf405bffa0e8bd29eb6b92b20f09e1da1224fe5ff12

See more details on using hashes here.

File details

Details for the file semantic_cleaning-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_cleaning-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 888004d0424d3d3fcaea3be41aef7d7541939152493a52ba66b9f50f8fdeb519
MD5 e064a479cfce45836b87caa5a5c34293
BLAKE2b-256 7b764f2da82efe19375ef8ca324128c50a33137f514704fa9fd392219447748a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page