Skip to main content

A comprehensive text cleaning and preprocessing pipeline.

Project description

SqueakyCleanText

PyPI PyPI - Downloads Python package Python Versions License

A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.

In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.

SqueakyCleanText simplifies the process by automatically addressing common text issues, ensuring your data is clean and well-structured with minimal effort on your part.

Key Features

  • Encoding Issues: Corrects text encoding problems and handles bad Unicode characters.
  • HTML and URLs: Removes or replaces HTML tags and URLs with configurable tokens.
  • Contact Information: Handles emails, phone numbers, and other contact details with customizable replacement tokens.
  • Named Entity Recognition (NER):
    • Multi-language support (English, Dutch, German, Spanish)
    • Ensemble voting technique for improved accuracy
    • Configurable confidence thresholds
    • Efficient batch processing
    • Automatic text chunking for long documents
    • GPU acceleration support
  • Text Normalization:
    • Removes isolated letters and symbols
    • Normalizes whitespace
    • Handles currency symbols
    • Year detection and replacement
    • Number standardization
  • Language Support:
    • Automatic language detection
    • Language-specific NER models
    • Language-aware stopword removal
  • Dual Output Formats:
    • Language Model format (preserves structure with tokens)
    • Statistical Model format (optimized for classical ML)
  • Performance Optimization:
    • Batch processing support
    • Configurable batch sizes
    • Memory-efficient processing of large texts
    • GPU memory management

Default Flow of cleaning Text

Benefits

For Language Models

  • Maintains text structure while anonymizing sensitive information
  • Configurable token replacements
  • Preserves context while removing noise
  • Handles long documents through intelligent chunking

For Statistical Models

  • Removes stopwords and punctuation
  • Case normalization
  • Special symbol removal
  • Optimized for classification tasks

Advanced NER Processing

  • Ensemble approach reduces missed entities
  • Language-specific models improve accuracy
  • Confidence thresholds for precision control
  • Efficient batch processing for large datasets
  • Automatic handling of long documents

Installation

pip install SqueakyCleanText

Usage

Basic Usage

from sct import sct

# Initialize the TextCleaner
sx = sct.TextCleaner()

# Process single text
text = "Hey John Doe, email me at john.doe@example.com"
lm_text, stat_text, language = sx.process(text)

# Process multiple texts efficiently
texts = ["Text 1", "Text 2", "Text 3"]
results = sx.process_batch(texts, batch_size=2)

Advanced Configuration

from sct import sct, config

# Customize NER settings
config.CHECK_NER_PROCESS = True
config.NER_CONFIDENCE_THRESHOLD = 0.85
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']

# Customize replacement tokens
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.REPLACE_WITH_PHONE_NUMBERS = "<PHONE>"

# Set known language (skips detection)
config.LANGUAGE = "ENGLISH"  # Options: ENGLISH, DUTCH, GERMAN, SPANISH

# Initialize with custom settings
sx = sct.TextCleaner()

API

sct.TextCleaner

process(text: str) -> Tuple[str, str, str]

Processes the input text and returns a tuple containing: - Cleaned text formatted for language models. - Cleaned text formatted for statistical models (stopwords removed). - Detected language of the text.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

The package took inspirations from the following repo:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SqueakyCleanText-0.3.0.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

SqueakyCleanText-0.3.0-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file SqueakyCleanText-0.3.0.tar.gz.

File metadata

  • Download URL: SqueakyCleanText-0.3.0.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for SqueakyCleanText-0.3.0.tar.gz
Algorithm Hash digest
SHA256 de17ddf0d62704f4046af1720e3f8f3ba4a8ff870e441d2f79bb8afbc27966fe
MD5 b6551bd4ba35afe7f3f902909f700668
BLAKE2b-256 6162ed5c72ed086d2296edf0a9a50e39b209392e3823fdd059978b8dc6698b38

See more details on using hashes here.

File details

Details for the file SqueakyCleanText-0.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for SqueakyCleanText-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c82795df50257d6eb4c6fc90e6bb6fd85d7aebf5966aa070cedc334bd878bf3
MD5 acca0bce9753268288d08b0bc4bfdabe
BLAKE2b-256 9c427986511d55bbe4a3d648065e097b8992436702966623484e66a40ecb986d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page