A comprehensive text cleaning and preprocessing pipeline.

These details have not been verified by PyPI

Project links

Homepage

Project description

SqueakyCleanText

A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.

In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.

SqueakyCleanText simplifies the process by automatically addressing common text issues, ensuring your data is clean and well-structured with minimal effort on your part.

Key Features

Encoding Issues: Corrects text encoding problems and handles bad Unicode characters.
HTML and URLs: Removes or replaces HTML tags and URLs with configurable tokens.
Contact Information: Handles emails, phone numbers, and other contact details with customizable replacement tokens.
Named Entity Recognition (NER):
- Multi-language support (English, Dutch, German, Spanish)
- Ensemble voting technique for improved accuracy
- Configurable confidence thresholds
- Lazy model loading (models load on demand per language)
- Automatic text chunking for long documents
- GPU acceleration support
Text Normalization:
- Removes isolated letters and symbols
- Normalizes whitespace
- Handles currency symbols
- Date and year detection and replacement
- Number standardization
- Configurable emoji removal
- Configurable bracket/brace content removal
Language Support:
- Automatic language detection
- Language-specific NER models
- Language-aware stopword removal
Dual Output Formats:
- Language Model format (preserves structure with tokens)
- Statistical Model format (optimized for classical ML)
Performance Optimization:
- Batch processing support
- Configurable batch sizes
- Memory-efficient processing of large texts
- GPU memory management

Default Flow of cleaning Text

Benefits

For Language Models

Maintains text structure while anonymizing sensitive information
Configurable token replacements
Preserves context while removing noise
Handles long documents through intelligent chunking

For Statistical Models

Removes stopwords and punctuation
Case normalization
Special symbol removal
Optimized for classification tasks

Advanced NER Processing

Ensemble approach reduces missed entities
Language-specific models improve accuracy
Confidence thresholds for precision control
Efficient batch processing for large datasets
Automatic handling of long documents

Installation

pip install SqueakyCleanText

Usage

Basic Usage

from sct import TextCleaner

# Initialize the TextCleaner
cleaner = TextCleaner()

# Input text
text = "Contact John Doe at john.doe@company.com. Meeting on 2023-10-01."

# Process the text
lm_text, stat_text, lang = cleaner.process(text)

print(f"Language Model format:    {lm_text}")
# Output: "Contact <PERSON> at <EMAIL>. Meeting on <YEAR>."

print(f"Statistical Model format: {stat_text}")
# Output: "contact meeting"

print(f"Detected Language: {lang}")
# Output: "ENGLISH"

Using TextCleanerConfig

from sct import TextCleaner, TextCleanerConfig

# Create an immutable configuration
cfg = TextCleanerConfig(
    check_ner_process=True,
    ner_confidence_threshold=0.85,
    positional_tags=('PER', 'LOC', 'ORG', 'MISC'),
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_numbers="<PHONE>",
    language="ENGLISH",  # Skip auto-detection
)

# Initialize with config
cleaner = TextCleaner(cfg=cfg)

Legacy Configuration (backward compatible)

from sct import sct, config

# Customize settings via module-level variables
config.CHECK_NER_PROCESS = True
config.NER_CONFIDENCE_THRESHOLD = 0.85
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.LANGUAGE = "ENGLISH"

# Initialize (reads from module-level config)
cleaner = sct.TextCleaner()

Batch Processing

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    check_remove_stopwords=True,
    check_remove_punctuation=True,
    check_ner_process=True,
    positional_tags=('PER', 'ORG', 'LOC'),
    ner_confidence_threshold=0.90,
)

cleaner = TextCleaner(cfg=cfg)

# Sample texts
texts = [
    "Email maria.garcia@example.es for more info.",  # Spanish
    "Besuchen Sie uns im Büro in Berlin.",           # German
    "Voor vragen, bel +31 20 123 4567.",             # Dutch
]

# Process texts in batch
results = cleaner.process_batch(texts, batch_size=2)

for lm_text, stat_text, lang in results:
    print(f"Language: {lang}")
    print(f"LM Format:    {lm_text}")
    print(f"Stat Format:  {stat_text}")
    print("-" * 40)

API

`TextCleaner`

`process(text: str) -> Tuple[str, Optional[str], Optional[str]]`

Processes the input text and returns a tuple containing:

Cleaned text formatted for language models.
Cleaned text formatted for statistical models (None if check_statistical_model_processing is False).
Detected language of the text (None if language detection is disabled).

`process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]`

Processes multiple texts. Each result follows the same format as process().

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

The package took inspirations from the following repo:

clean-text

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.6.1

Feb 28, 2026

0.6.0

Feb 28, 2026

0.5.2

Feb 28, 2026

0.5.1

Feb 23, 2026

0.5.0

Feb 23, 2026

0.4.5

Feb 23, 2026

0.4.1

Feb 23, 2026

This version

0.4.0

Feb 23, 2026

0.3.0

Nov 13, 2024

0.2.6

Aug 17, 2024

0.2.5

Aug 17, 2024

0.2.4

Aug 17, 2024

0.2.3

Aug 17, 2024

0.2.2

Aug 17, 2024

0.2.1

Aug 17, 2024

0.2.0

Aug 17, 2024

0.1.8

Aug 17, 2024

0.1.6

Aug 17, 2024

0.1.5

Aug 17, 2024

0.1.4

Aug 17, 2024

0.1.3

Aug 9, 2024

0.1.2

Jun 16, 2024

0.1.1

Jun 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

squeakycleantext-0.4.0.tar.gz (31.4 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

squeakycleantext-0.4.0-py3-none-any.whl (24.8 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file squeakycleantext-0.4.0.tar.gz.

File metadata

Download URL: squeakycleantext-0.4.0.tar.gz
Upload date: Feb 23, 2026
Size: 31.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for squeakycleantext-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`7d973d94c6504c2b46e37b8ceec202b22d4fb8f52a27c426729ad48fe410a108`
MD5	`774b4e0f13f9d3a5c1201c2113b16dfb`
BLAKE2b-256	`f2b3c9ea725d975d4d5bf88176af81f640d9483c335ff26d6e64a99ab48fa08b`

See more details on using hashes here.

File details

Details for the file squeakycleantext-0.4.0-py3-none-any.whl.

File metadata

Download URL: squeakycleantext-0.4.0-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 24.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for squeakycleantext-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b4188b76cee17da085eaa89a1d1a45cd907699da05534c79842d6f9ad2721db5`
MD5	`35c41acf5fa140f5b417f501bce8b22f`
BLAKE2b-256	`a8f39bc15d5d87f67d8dc1d5c677bd620d9024d466c66f2737ae6b79ffd36c6a`

See more details on using hashes here.

SqueakyCleanText 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SqueakyCleanText

Key Features

Benefits

For Language Models

For Statistical Models

Advanced NER Processing

Installation

Usage

Basic Usage

Using TextCleanerConfig

Legacy Configuration (backward compatible)

Batch Processing

API

TextCleaner

process(text: str) -> Tuple[str, Optional[str], Optional[str]]

process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]

Contributing

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`TextCleaner`

`process(text: str) -> Tuple[str, Optional[str], Optional[str]]`

`process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]`