Skip to main content

Tools for semantic cleaning of a test dataset

Project description

semantic-cleaning

Open In Colab

This file will become your README and also the index of your documentation.

Install

pip install semantic_cleaning

How to use

import os
from tqdm.auto import tqdm
from typing import List, Dict, Set, Union, Callable
import torch
from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import torch.nn.functional as F
import transformers

Processing a dataset to get a sentence for QA or comment and response etc.

data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
2

Compute the embadding fot the sentences

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature  = '_merged'):

We can get the indicis of all the duplicated lines with the folowing command:

to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)

The full process could be run like this

  deduplicated = deduplicate_dataset(
      dataset = data['train'], 
      model = model, 
      tokenizer = tokenizer,
      epsilon = 1e-2, 
      model_batch_size = 64, 
      deduplication_batch_size = 20000, 
      num_workers = 16,
      dataset_feature = '_merged'
  )
  print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")

And deduplicated can be pushed back to the hub or saved on local drive

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic-cleaning-0.0.2.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

semantic_cleaning-0.0.2-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file semantic-cleaning-0.0.2.tar.gz.

File metadata

  • Download URL: semantic-cleaning-0.0.2.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for semantic-cleaning-0.0.2.tar.gz
Algorithm Hash digest
SHA256 70b9e16453413da824fef2a2ddb0cce07f2e1c2fc3e8bf083cd5316ef3deed7d
MD5 44e194f33feabccdaf3f3049a2544866
BLAKE2b-256 d87917150ac919d83f75c67b21100f7083196cf275635a92f207137940c89141

See more details on using hashes here.

File details

Details for the file semantic_cleaning-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_cleaning-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ec2a8e8aa37e080ac4528d006221316d30f9e2c33597109422e0773c12aada3b
MD5 14cdb623120b55e8b111dd01b5ceffdc
BLAKE2b-256 3a0da721e4c524cb22aa2ac76f5b31e527af60549003fca406f748e066397911

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page