A Python package to clean text from META tags using a BERT large Longformer NER model.
Project description
meta_cleaner
meta_cleaner
is a Python package designed to clean text from META tags using XLM-RoBERTa (large-sized model).
trainer.ipynb
is a notebook that creates a dataset and a NER model.
Installation
pip install meta-cleaner
or
pip install git+https://github.com/pirr-me/meta_cleaner.git
Install Locally
To install locally in editable mode (for development):
pip install -e .
Usage
from meta_cleaner.cleaner import MetaCleaner
# Initialize the MetaCleaner
meta_cleaner = MetaCleaner(model_name_or_path='Pirr/longformer-4096-large-ner')
# List of texts to clean
texts = ['[Genre: Fiction, Consensual Sex, Oral Sex, Romance, Teen Male/Teen Female]\nCHAPTER 15\nWe walked together into the school office to turn in our early dismissal note. Mrs. Roscoe laughed when she saw it. "Don\'t trust him to go on his own, eh Cinda? I can\'t say that I blame you. I never let my husband go on his own either. He can never remember what the doctor said. Okay, here\'s your pass." She handed Cinda the blue excuse slip and we walked out into the hallway. One of the first people I saw was Mitch. He didn\'t look very happy.\n"Surely you\'re still not pissed about New Year\'s Eve?"\n"Yeah, but not your part. We got stopped by the cops not even five minutes after leaving your place.']
# Clean the texts with batch inference
cleaned_texts = meta_cleaner.clean(texts, batch_size=8, confidence_threshold=0.8)
# Display the cleaned texts
for i, cleaned_text in enumerate(cleaned_texts):
print(f"Cleaned Text {i + 1}:\n{cleaned_text}\n")
Clean data from GCP
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from google.cloud import storage
from tqdm import tqdm
from datasets import Dataset
from meta_cleaner.cleaner import MetaCleaner
storage_client = storage.Client()
bucket_name = 'pirr-training-data'
prefix = 'Processed_data/cleaned-venus'
bucket = storage_client.bucket(bucket_name)
blobs = list(storage_client.list_blobs(bucket_name, prefix=prefix))
txt_blobs = [blob for blob in blobs if blob.name.endswith('.txt')]
def download_blob(blob):
return blob.download_as_text()
texts = []
with ThreadPoolExecutor(max_workers=16) as executor:
future_to_blob = {executor.submit(download_blob, blob): blob for blob in txt_blobs}
for future in tqdm(as_completed(future_to_blob), total=len(txt_blobs), desc="Downloading files"):
blob = future_to_blob[future]
try:
content = future.result()
texts.append(content)
except Exception as e:
print(f"Error downloading {blob.name}: {e}")
meta_cleaner = MetaCleaner(model_name_or_path='Pirr/longformer-4096-large-ner')
cleaned_texts = meta_cleaner.clean(texts, batch_size=16, confidence_threshold=0.8)
dataset = Dataset.from_dict({"text": cleaned_texts})
#dataset.push_to_hub("...")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
meta_cleaner-0.4.2.tar.gz
(5.1 kB
view details)
Built Distribution
File details
Details for the file meta_cleaner-0.4.2.tar.gz
.
File metadata
- Download URL: meta_cleaner-0.4.2.tar.gz
- Upload date:
- Size: 5.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a4deb5ed3a8c9ab51cfab1821dc77ae76d4c03d7cbe3ae6871facfbb2f1b4f92 |
|
MD5 | e07df7a7ed2c3879ea8f2a96fd70bd35 |
|
BLAKE2b-256 | 018a6c609a08f6b5f3940d4d0eff85a664b0cc818899388aeb8cddfe70f92716 |
File details
Details for the file meta_cleaner-0.4.2-py3-none-any.whl
.
File metadata
- Download URL: meta_cleaner-0.4.2-py3-none-any.whl
- Upload date:
- Size: 5.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | dc4a14d955620495544fd5abcc9c467e199404fc06008c5c887fa228383b154f |
|
MD5 | e34e719c76efed2f8e3bd7ce6fe89223 |
|
BLAKE2b-256 | e3239e6a23dcb91b71672b3d12187ed8902267bc33ce414413af4682a4aeda9f |