A Python library for cleaning and preprocessing text data with asynchronous and multithreading capabilities.

These details have not been verified by PyPI

Project description

TextPrettifier

TextPrettifier is a Python library for cleaning text data by removing HTML tags, URLs, numbers, special characters, contractions, and stopwords. It now features asynchronous processing and multithreading capabilities for efficient processing of large texts.

Key Features

Text Cleaning Features

1. Removing Emojis

The remove_emojis method removes emojis from the text.

2. Removing Internet Words

The remove_internet_words method removes internet-specific words from the text.

3. Removing HTML Tags

The remove_html_tags method removes HTML tags from the text.

4. Removing URLs

The remove_urls method removes URLs from the text.

5. Removing Numbers

The remove_numbers method removes numbers from the text.

6. Removing Special Characters

The remove_special_chars method removes special characters from the text.

7. Expanding Contractions

The remove_contractions method expands contractions in the text.

8. Removing Stopwords

The remove_stopwords method removes stopwords from the text.

Advanced Processing Features

9. Asynchronous Processing

All methods have async counterparts prefixed with 'a' (e.g., aremove_emojis) for non-blocking operations.

10. Batch Processing

Process multiple texts in parallel with process_batch and aprocess_batch.

11. Chunked Processing for Large Texts

Efficiently process large texts with chunk_and_process and achunk_and_process.

12. Lemmatization and Stemming

Apply lemmatization or stemming to text with dedicated methods.

Installation

You can install TextPrettifier using pip:

pip install text-prettifier

Quick Start

Basic Usage

from text_prettifier import TextPrettifier

# Initialize TextPrettifier
text_prettifier = TextPrettifier()

# Example: Remove Emojis
html_text = "Hi,Pythonogist! I ❤️ Python."
cleaned_html = text_prettifier.remove_emojis(html_text)
print(cleaned_html)  # Output: Hi,Pythonogist! I Python.

# Example: Apply all cleaning methods
all_text = "<p>Hello, @world!</p> There are 123 apples. I can't do it. This is a test."
all_cleaned = text_prettifier.sigma_cleaner(all_text, is_lower=True)
print(all_cleaned)  # Output: hello world 123 apples cannot test

# Get tokens with cleaning
tokens = text_prettifier.sigma_cleaner(all_text, is_token=True, is_lower=True)
print(tokens)  # Output: ['hello', 'world', '123', 'apples', 'cannot', 'test']

Asynchronous Processing

import asyncio
from text_prettifier import TextPrettifier

async def process_text():
    text_prettifier = TextPrettifier()
    
    text = "Hello, @world! 123 I can't believe it. 😊"
    result = await text_prettifier.asigma_cleaner(text, is_lower=True)
    print(result)  # Output: hello world 123 cannot believe

# Run the async function
asyncio.run(process_text())

Batch Processing

from text_prettifier import TextPrettifier

# Initialize with specific number of worker threads
text_prettifier = TextPrettifier(max_workers=4)

# Process multiple texts in parallel
texts = [
    "Hello, how are you? 😊",
    "<p>This is HTML</p> content",
    "Visit https://example.com for more info",
    "I can't believe it's not butter!"
]

# Synchronous batch processing
results = text_prettifier.process_batch(texts, is_lower=True)
for text, result in zip(texts, results):
    print(f"Original: {text}")
    print(f"Cleaned: {result}")
    print()

# Asynchronous batch processing
async def process_async():
    results = await text_prettifier.aprocess_batch(texts, is_lower=True)
    for text, result in zip(texts, results):
        print(f"Original: {text}")
        print(f"Cleaned: {result}")
        print()

# Run in an async environment
# asyncio.run(process_async())

Processing Large Texts

from text_prettifier import TextPrettifier

text_prettifier = TextPrettifier()

# Create a large text for demonstration
large_text = "Hello, this is a sample text with some HTML <p>tags</p> and URLs https://example.com and emojis 😊" * 1000

# Process the large text efficiently by chunking
result = text_prettifier.chunk_and_process(
    large_text,
    chunk_size=5000,  # Process in chunks of 5000 characters
    is_lower=True,
    keep_numbers=True
)

print(f"Original length: {len(large_text)}")
print(f"Processed length: {len(result)}")

# Asynchronous processing of large text
async def process_large_async():
    result = await text_prettifier.achunk_and_process(
        large_text,
        chunk_size=5000,
        is_lower=True,
        keep_numbers=True
    )
    print(f"Original length: {len(large_text)}")
    print(f"Processed length: {len(result)}")

# Run in an async environment
# asyncio.run(process_large_async())

Lemmatization and Stemming

from text_prettifier import TextPrettifier

text_prettifier = TextPrettifier()

text = "I am running in the park with friends"

# Apply lemmatization
lemmatized = text_prettifier.sigma_cleaner(text, is_lemmatize=True)
print(lemmatized)  # Output: I run park friend

# Apply stemming
stemmed = text_prettifier.sigma_cleaner(text, is_stemming=True)
print(stemmed)  # Output: I run park friend

Advanced Configuration

TextPrettifier supports various configuration options:

text_prettifier = TextPrettifier(max_workers=8)  # Set maximum worker threads

# Configure sigma_cleaner options
result = text_prettifier.sigma_cleaner(
    text,
    is_token=True,       # Return tokens instead of a string
    is_lower=True,       # Convert to lowercase
    is_lemmatize=True,   # Apply lemmatization
    is_stemming=False,   # Don't apply stemming (would override lemmatization)
    keep_numbers=True    # Keep numbers in the text
)

Contact Information

Feel free to reach out to me on social media:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

2.0.1

May 12, 2025

1.2.0

May 12, 2025

1.1.4

Aug 17, 2024

1.1.3

May 2, 2024

1.1.2

May 2, 2024

1.1.0

Apr 30, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

text_prettifier-2.0.1.tar.gz (12.2 kB view details)

Uploaded May 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

text_prettifier-2.0.1-py3-none-any.whl (11.1 kB view details)

Uploaded May 12, 2025 Python 3

File details

Details for the file text_prettifier-2.0.1.tar.gz.

File metadata

Download URL: text_prettifier-2.0.1.tar.gz
Upload date: May 12, 2025
Size: 12.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for text_prettifier-2.0.1.tar.gz
Algorithm	Hash digest
SHA256	`cf5ba304308814481de759bdeba3328ae282ebf5e1de1557477f8df76b4e3846`
MD5	`2b2bea6cfd28fdd2f6db3339387af5a2`
BLAKE2b-256	`10eb8beb41c7b0a12e6114a441edf297a9791de52dc3963ddfd749cd989ab94b`

See more details on using hashes here.

File details

Details for the file text_prettifier-2.0.1-py3-none-any.whl.

File metadata

Download URL: text_prettifier-2.0.1-py3-none-any.whl
Upload date: May 12, 2025
Size: 11.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for text_prettifier-2.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4821c69bd2df7170293c88c3daad457e92d4af5cd35aa970e99222d3c4776e42`
MD5	`3a51cfd0419c722384ca71f96ffaf719`
BLAKE2b-256	`3acb58d105a5735a206591f8c316bcec3b118be39e1fdd5c6d6f733a729c5b92`

See more details on using hashes here.

text-prettifier 2.0.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

TextPrettifier

Key Features

Text Cleaning Features

1. Removing Emojis

2. Removing Internet Words

3. Removing HTML Tags

4. Removing URLs

5. Removing Numbers

6. Removing Special Characters

7. Expanding Contractions

8. Removing Stopwords

Advanced Processing Features

9. Asynchronous Processing

10. Batch Processing

11. Chunked Processing for Large Texts

12. Lemmatization and Stemming

Installation

Quick Start

Basic Usage

Asynchronous Processing

Batch Processing

Processing Large Texts

Lemmatization and Stemming

Advanced Configuration

Contact Information

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes