Skip to main content

Tools for semantic cleaning of a test dataset

Project description

semantic-cleaning

Open In Colab

This file will become your README and also the index of your documentation.

Install

pip install semantic_cleaning

How to use

import os
from tqdm.auto import tqdm
from typing import List, Dict, Set, Union, Callable
import torch
from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import torch.nn.functional as F
import transformers

Processing a dataset to get a sentence for QA or comment and response etc.

data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
2

Compute the embadding fot the sentences

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature  = '_merged'):

We can get the indicis of all the duplicated lines with the folowing command:

to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)

The full process could be run like this

  deduplicated = deduplicate_dataset(
      dataset = data['train'], 
      model = model, 
      tokenizer = tokenizer,
      epsilon = 1e-2, 
      model_batch_size = 64, 
      deduplication_batch_size = 20000, 
      num_workers = 16,
      dataset_feature = '_merged'
  )
  print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")

And deduplicated can be pushed back to the hub or saved on local drive

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic-cleaning-0.0.1.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

semantic_cleaning-0.0.1-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file semantic-cleaning-0.0.1.tar.gz.

File metadata

  • Download URL: semantic-cleaning-0.0.1.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.11

File hashes

Hashes for semantic-cleaning-0.0.1.tar.gz
Algorithm Hash digest
SHA256 573086f884079d1688342253356acf61f05d60e2b38fb7747c79522f3274727f
MD5 ddb092fe2319c079aea76ddabcace10e
BLAKE2b-256 e0b4670e9f07cd2388421819c5e3aa2f50d24349eee50f0f8fc4e39b0b3fb7a2

See more details on using hashes here.

File details

Details for the file semantic_cleaning-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_cleaning-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 647f91336f416ba1f206c279c5ff48ea0a478c46766c98649049283f59203a6f
MD5 c46260d9663ebe5a6210818f71ed5b83
BLAKE2b-256 b844b50ae48af802dbebe6efc89abaad17dca152365bab3467b232e612a1f409

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page