Skip to main content

A Python package to clean text from META tags using a BERT large Longformer NER model.

Project description

meta_cleaner

meta_cleaner is a Python package designed to clean text from META tags using XLM-RoBERTa (large-sized model).

trainer.ipynb is a notebook that creates a dataset and a NER model.

Installation

pip install meta-cleaner

or

pip install git+https://github.com/pirr-me/meta_cleaner.git

Install Locally

To install locally in editable mode (for development):

pip install -e .

Usage

from meta_cleaner.cleaner import MetaCleaner

# Initialize the MetaCleaner
meta_cleaner = MetaCleaner(model_name_or_path='Pirr/longformer-4096-large-ner')

# List of texts to clean
texts = ['[Genre: Fiction, Consensual Sex, Oral Sex, Romance, Teen Male/Teen Female]\nCHAPTER 15\nWe walked together into the school office to turn in our early dismissal note. Mrs. Roscoe laughed when she saw it. "Don\'t trust him to go on his own, eh Cinda? I can\'t say that I blame you. I never let my husband go on his own either. He can never remember what the doctor said. Okay, here\'s your pass." She handed Cinda the blue excuse slip and we walked out into the hallway. One of the first people I saw was Mitch. He didn\'t look very happy.\n"Surely you\'re still not pissed about New Year\'s Eve?"\n"Yeah, but not your part. We got stopped by the cops not even five minutes after leaving your place.']

# Clean the texts with batch inference
cleaned_texts = meta_cleaner.clean(texts, batch_size=8, confidence_threshold=0.8)

# Display the cleaned texts
for i, cleaned_text in enumerate(cleaned_texts):
    print(f"Cleaned Text {i + 1}:\n{cleaned_text}\n")

Clean data from GCP

import os
from concurrent.futures import ThreadPoolExecutor, as_completed
from google.cloud import storage
from tqdm import tqdm
from datasets import Dataset
from meta_cleaner.cleaner import MetaCleaner

storage_client = storage.Client()

bucket_name = 'pirr-training-data'
prefix = 'Processed_data/cleaned-venus'

bucket = storage_client.bucket(bucket_name)
blobs = list(storage_client.list_blobs(bucket_name, prefix=prefix))

txt_blobs = [blob for blob in blobs if blob.name.endswith('.txt')]

def download_blob(blob):
    return blob.download_as_text()

texts = []
with ThreadPoolExecutor(max_workers=16) as executor:
    future_to_blob = {executor.submit(download_blob, blob): blob for blob in txt_blobs}
    
    for future in tqdm(as_completed(future_to_blob), total=len(txt_blobs), desc="Downloading files"):
        blob = future_to_blob[future]
        try:
            content = future.result()
            texts.append(content)
        except Exception as e:
            print(f"Error downloading {blob.name}: {e}")


meta_cleaner = MetaCleaner(model_name_or_path='Pirr/longformer-4096-large-ner')
cleaned_texts = meta_cleaner.clean(texts, batch_size=16, confidence_threshold=0.8)
dataset = Dataset.from_dict({"text": cleaned_texts})

#dataset.push_to_hub("...")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meta_cleaner-0.4.2.tar.gz (5.1 kB view details)

Uploaded Source

Built Distribution

meta_cleaner-0.4.2-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file meta_cleaner-0.4.2.tar.gz.

File metadata

  • Download URL: meta_cleaner-0.4.2.tar.gz
  • Upload date:
  • Size: 5.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.4

File hashes

Hashes for meta_cleaner-0.4.2.tar.gz
Algorithm Hash digest
SHA256 a4deb5ed3a8c9ab51cfab1821dc77ae76d4c03d7cbe3ae6871facfbb2f1b4f92
MD5 e07df7a7ed2c3879ea8f2a96fd70bd35
BLAKE2b-256 018a6c609a08f6b5f3940d4d0eff85a664b0cc818899388aeb8cddfe70f92716

See more details on using hashes here.

File details

Details for the file meta_cleaner-0.4.2-py3-none-any.whl.

File metadata

File hashes

Hashes for meta_cleaner-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 dc4a14d955620495544fd5abcc9c467e199404fc06008c5c887fa228383b154f
MD5 e34e719c76efed2f8e3bd7ce6fe89223
BLAKE2b-256 e3239e6a23dcb91b71672b3d12187ed8902267bc33ce414413af4682a4aeda9f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page