Skip to main content

Tools for semantic cleaning of a test dataset

Project description

semantic-cleaning

Open In Colab

This file will become your README and also the index of your documentation.

Install

pip install semantic_cleaning

How to use

import os
from tqdm.auto import tqdm
from typing import List, Dict, Set, Union, Callable
import torch
from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import torch.nn.functional as F
import transformers

Processing a dataset to get a sentence for QA or comment and response etc.

data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
2

Compute the embadding fot the sentences

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature  = '_merged'):

We can get the indicis of all the duplicated lines with the folowing command:

to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)

The full process could be run like this

  deduplicated = deduplicate_dataset(
      dataset = data['train'], 
      model = model, 
      tokenizer = tokenizer,
      epsilon = 1e-2, 
      model_batch_size = 64, 
      deduplication_batch_size = 20000, 
      num_workers = 16,
      dataset_feature = '_merged'
  )
  print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")

And deduplicated can be pushed back to the hub or saved on local drive

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic-cleaning-0.0.3.tar.gz (10.4 kB view hashes)

Uploaded Source

Built Distribution

semantic_cleaning-0.0.3-py3-none-any.whl (10.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page