Tools for semantic cleaning of a test dataset
Project description
semantic-cleaning
This file will become your README and also the index of your documentation.
Install
pip install semantic_cleaning
How to use
import os
from tqdm.auto import tqdm
from typing import List, Dict, Set, Union, Callable
import torch
from torch.utils.data import DataLoader
from datasets import Dataset, load_dataset
import numpy as np
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import torch.nn.functional as F
import transformers
Processing a dataset to get a sentence for QA or comment and response etc.
data = load_dataset("0-hero/OIG-small-chip2")
_ = preprocess_data(data,schema = ":{user} :{chip2}")
data['train']['_merged'][0]
2
Compute the embadding fot the sentences
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-mpnet-base-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-mpnet-base-v2').to('cuda')
embedding = compute_embeddings(data = data, embedding_model = model, tokenizer = tokenizer, batch_size = 64, num_workers =16, dataset_feature = '_merged'):
We can get the indicis of all the duplicated lines with the folowing command:
to_delete = deduplicate_embeddings(embedded =embeddeing, epsilon=1e-2, batch_size=20000)
The full process could be run like this
deduplicated = deduplicate_dataset(
dataset = data['train'],
model = model,
tokenizer = tokenizer,
epsilon = 1e-2,
model_batch_size = 64,
deduplication_batch_size = 20000,
num_workers = 16,
dataset_feature = '_merged'
)
print (f"cleaned:{(1-len(deduplicated)/len(data['train']))*100:.2f}:%")
And deduplicated can be pushed back to the hub or saved on local drive
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
semantic-cleaning-0.0.1.tar.gz
(10.5 kB
view details)
Built Distribution
File details
Details for the file semantic-cleaning-0.0.1.tar.gz
.
File metadata
- Download URL: semantic-cleaning-0.0.1.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 573086f884079d1688342253356acf61f05d60e2b38fb7747c79522f3274727f |
|
MD5 | ddb092fe2319c079aea76ddabcace10e |
|
BLAKE2b-256 | e0b4670e9f07cd2388421819c5e3aa2f50d24349eee50f0f8fc4e39b0b3fb7a2 |
File details
Details for the file semantic_cleaning-0.0.1-py3-none-any.whl
.
File metadata
- Download URL: semantic_cleaning-0.0.1-py3-none-any.whl
- Upload date:
- Size: 10.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.11
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 647f91336f416ba1f206c279c5ff48ea0a478c46766c98649049283f59203a6f |
|
MD5 | c46260d9663ebe5a6210818f71ed5b83 |
|
BLAKE2b-256 | b844b50ae48af802dbebe6efc89abaad17dca152365bab3467b232e612a1f409 |