A comprehensive text cleaning and preprocessing pipeline.
Project description
SqueakyCleanText
A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.
In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.
SqueakyCleanText simplifies the process by automatically addressing common text issues, ensuring your data is clean and well-structured with minimal effort on your part.
Key Features
- Encoding Issues: Corrects text encoding problems and handles bad Unicode characters.
- HTML and URLs: Removes or replaces HTML tags and URLs with configurable tokens.
- Contact Information: Handles emails, phone numbers, and other contact details with customizable replacement tokens.
- Named Entity Recognition (NER):
- Multi-language support (English, Dutch, German, Spanish)
- Ensemble voting technique for improved accuracy
- Configurable confidence thresholds
- Lazy model loading (models load on demand per language)
- Automatic text chunking for long documents
- GPU acceleration support
- Text Normalization:
- Removes isolated letters and symbols
- Normalizes whitespace
- Handles currency symbols
- Date and year detection and replacement
- Number standardization
- Configurable emoji removal
- Configurable bracket/brace content removal
- Language Support:
- Automatic language detection
- Language-specific NER models
- Language-aware stopword removal
- Dual Output Formats:
- Language Model format (preserves structure with tokens)
- Statistical Model format (optimized for classical ML)
- Performance Optimization:
- Batch processing support
- Configurable batch sizes
- Memory-efficient processing of large texts
- GPU memory management
Benefits
For Language Models
- Maintains text structure while anonymizing sensitive information
- Configurable token replacements
- Preserves context while removing noise
- Handles long documents through intelligent chunking
For Statistical Models
- Removes stopwords and punctuation
- Case normalization
- Special symbol removal
- Optimized for classification tasks
Advanced NER Processing
- Ensemble approach reduces missed entities
- Language-specific models improve accuracy
- Confidence thresholds for precision control
- Efficient batch processing for large datasets
- Automatic handling of long documents
Installation
pip install SqueakyCleanText
Usage
Basic Usage
from sct import TextCleaner
# Initialize the TextCleaner
cleaner = TextCleaner()
# Input text
text = "Contact John Doe at john.doe@company.com. Meeting on 2023-10-01."
# Process the text
lm_text, stat_text, lang = cleaner.process(text)
print(f"Language Model format: {lm_text}")
# Output: "Contact <PERSON> at <EMAIL>. Meeting on <YEAR>."
print(f"Statistical Model format: {stat_text}")
# Output: "contact meeting"
print(f"Detected Language: {lang}")
# Output: "ENGLISH"
Using TextCleanerConfig
from sct import TextCleaner, TextCleanerConfig
# Create an immutable configuration
cfg = TextCleanerConfig(
check_ner_process=True,
ner_confidence_threshold=0.85,
positional_tags=('PER', 'LOC', 'ORG', 'MISC'),
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_numbers="<PHONE>",
language="ENGLISH", # Skip auto-detection
)
# Initialize with config
cleaner = TextCleaner(cfg=cfg)
Legacy Configuration (backward compatible)
from sct import sct, config
# Customize settings via module-level variables
config.CHECK_NER_PROCESS = True
config.NER_CONFIDENCE_THRESHOLD = 0.85
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.LANGUAGE = "ENGLISH"
# Initialize (reads from module-level config)
cleaner = sct.TextCleaner()
Batch Processing
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
check_remove_stopwords=True,
check_remove_punctuation=True,
check_ner_process=True,
positional_tags=('PER', 'ORG', 'LOC'),
ner_confidence_threshold=0.90,
)
cleaner = TextCleaner(cfg=cfg)
# Sample texts
texts = [
"Email maria.garcia@example.es for more info.", # Spanish
"Besuchen Sie uns im Büro in Berlin.", # German
"Voor vragen, bel +31 20 123 4567.", # Dutch
]
# Process texts in batch
results = cleaner.process_batch(texts, batch_size=2)
for lm_text, stat_text, lang in results:
print(f"Language: {lang}")
print(f"LM Format: {lm_text}")
print(f"Stat Format: {stat_text}")
print("-" * 40)
API
TextCleaner
process(text: str) -> Tuple[str, Optional[str], Optional[str]]
Processes the input text and returns a tuple containing:
- Cleaned text formatted for language models.
- Cleaned text formatted for statistical models (
Noneifcheck_statistical_model_processingisFalse). - Detected language of the text (
Noneif language detection is disabled).
process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]
Processes multiple texts. Each result follows the same format as process().
Contributing
Contributions are welcome! Please feel free to submit a Pull Request or open an issue.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
The package took inspirations from the following repo:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file squeakycleantext-0.4.0.tar.gz.
File metadata
- Download URL: squeakycleantext-0.4.0.tar.gz
- Upload date:
- Size: 31.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7d973d94c6504c2b46e37b8ceec202b22d4fb8f52a27c426729ad48fe410a108
|
|
| MD5 |
774b4e0f13f9d3a5c1201c2113b16dfb
|
|
| BLAKE2b-256 |
f2b3c9ea725d975d4d5bf88176af81f640d9483c335ff26d6e64a99ab48fa08b
|
File details
Details for the file squeakycleantext-0.4.0-py3-none-any.whl.
File metadata
- Download URL: squeakycleantext-0.4.0-py3-none-any.whl
- Upload date:
- Size: 24.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b4188b76cee17da085eaa89a1d1a45cd907699da05534c79842d6f9ad2721db5
|
|
| MD5 |
35c41acf5fa140f5b417f501bce8b22f
|
|
| BLAKE2b-256 |
a8f39bc15d5d87f67d8dc1d5c677bd620d9024d466c66f2737ae6b79ffd36c6a
|