Text preprocessing & PII anonymization pipeline for NLP/ML: ONNX NER ensemble, language detection, stopword removal, and configurable token replacement.

These details have not been verified by PyPI

Project links

Project description

SqueakyCleanText

A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.

Using an AI coding assistant? This repo includes an llms.txt with the full API surface, config reference, and Q&A — optimised for Claude, Cursor, Copilot, and ChatGPT.

In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.

SqueakyCleanText simplifies the process by automatically addressing common text issues — removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.

Key Features

Named Entity Recognition (NER):
- Multi-backend: ONNX (default, torch-free), PyTorch, GLiNER, and ensemble modes
- Zero-shot custom entities via GLiNER (e.g., PRODUCT, EVENT, SKILL)
- Multi-language support (English, Dutch, German, Spanish, French, Portuguese, Italian)
- Ensemble voting across backends for improved accuracy
- Configurable confidence thresholds
- Lazy model loading (models load on demand per language)
- Shared ONNX sessions across same-model languages (~600 MB RAM saved)
- Automatic text chunking for long documents (CJK/Arabic safe)
- GPU acceleration support (CUDA for ONNX and PyTorch)
- Model warm-up API to pre-load on startup
Text Normalization:
- Corrects text encoding problems and handles bad Unicode characters
- Removes or replaces HTML tags and URLs with configurable tokens
- Handles emails, phone numbers, and other contact details
- Multilingual date detection and replacement (ISO 8601, month names, common formats)
- Fuzzy date matching for misspelled months (requires [fuzzy] extra)
- Year and number standardization
- Configurable emoji removal
- Configurable bracket/brace content removal
- Removes isolated letters and symbols
- Normalizes whitespace and handles currency symbols
- Smart case folding (preserves NER tokens like <PERSON>)
Language Support:
- Automatic language detection (English, Dutch, German, Spanish)
- Language-specific NER models; French, Portuguese, Italian via multilingual model
- Language-aware stopword removal
- Extensible: add custom languages with stopwords, month names, and NER models
Dual Output Formats:
- Language Model format (preserves structure with tokens)
- Statistical Model format (optimized for classical ML)
Performance:
- ONNX Runtime inference (torch-free base install, ~3-5x faster than PyTorch)
- Thread-parallel batch processing via ThreadPoolExecutor
- Async batch processing (aprocess_batch) for FastAPI / aiohttp
- Lazy model loading (only loads models as needed)
- Shared ONNX sessions for same-model languages (saves ~600 MB for FR/PT/IT)
- Memory-efficient processing of large texts
- GPU acceleration (CUDA) for both ONNX and PyTorch backends

Default Flow of cleaning Text

Benefits

For Language Models

Maintains text structure while anonymizing sensitive information
Configurable token replacements
Preserves context while removing noise
Handles long documents through intelligent chunking

For Statistical Models

Removes stopwords and punctuation
Case normalization
Special symbol removal
Optimized for classification tasks

Advanced NER Processing

Ensemble approach reduces missed entities
Language-specific models improve accuracy
Confidence thresholds for precision control
Efficient batch processing for large datasets
Automatic handling of long documents

Installation

pip install SqueakyCleanText

The base install uses ONNX Runtime for NER inference — no PyTorch or Transformers required.

Optional Extras

Extra	Command	What it adds
GPU	`pip install SqueakyCleanText[gpu]`	CUDA-accelerated ONNX inference
Fuzzy dates	`pip install SqueakyCleanText[fuzzy]`	Fuzzy month name matching (rapidfuzz)
PyTorch NER	`pip install SqueakyCleanText[torch]`	PyTorch/Transformers NER backend
GLiNER	`pip install SqueakyCleanText[gliner]`	GLiNER zero-shot NER
GLiNER2	`pip install SqueakyCleanText[gliner2]`	GLiNER2 (knowledgator) backend
All NER	`pip install SqueakyCleanText[all-ner]`	All NER backends combined
Development	`pip install SqueakyCleanText[dev]`	Testing and linting tools

You can combine extras: pip install SqueakyCleanText[gpu,fuzzy,gliner]

Usage

Basic Usage

from sct import TextCleaner

# Initialize the TextCleaner
cleaner = TextCleaner()

# Input text
text = "Contact John Doe at john.doe@company.com. Meeting on 2023-10-01."

# Process the text
lm_text, stat_text, lang = cleaner.process(text)

print(f"Language Model format:    {lm_text}")
# Output: "Contact <PERSON> at <EMAIL>. Meeting on <YEAR>."

print(f"Statistical Model format: {stat_text}")
# Output: "contact meeting"

print(f"Detected Language: {lang}")
# Output: "ENGLISH"

Using TextCleanerConfig

from sct import TextCleaner, TextCleanerConfig

# Create an immutable configuration
cfg = TextCleanerConfig(
    check_ner_process=True,
    ner_confidence_threshold=0.85,
    positional_tags=('PER', 'LOC', 'ORG', 'MISC'),
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_numbers="<PHONE>",
    language="ENGLISH",  # Skip auto-detection
)

# Initialize with config
cleaner = TextCleaner(cfg=cfg)

GLiNER: Zero-Shot Custom NER

Use GLiNER to recognize any entity type without retraining:

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='gliner',
    gliner_model='urchade/gliner_large-v2.1',
    gliner_labels=('person', 'organization', 'location', 'product', 'event'),
    gliner_label_map={
        'person': 'PER', 'organization': 'ORG', 'location': 'LOC',
        # 'product' and 'event' are unmapped — they become <PRODUCT>, <EVENT> tokens
    },
    gliner_threshold=0.4,
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
    "John bought an iPhone at the Apple Store in Berlin during CES 2025."
)
# lm_text: "<PERSON> bought an <PRODUCT> at the <ORGANISATION> in <LOCATION> during <EVENT>."

Ensemble NER

Combine ONNX/Torch models with GLiNER for improved recall via ensemble voting:

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    ner_backend='ensemble_onnx',  # or 'ensemble_torch'
    gliner_model='urchade/gliner_large-v2.1',
    gliner_labels=('person', 'organization', 'location'),
    gliner_label_map={'person': 'PER', 'organization': 'ORG', 'location': 'LOC'},
)

cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process("Angela Merkel visited the Bundestag in Berlin.")

Batch Processing

from sct import TextCleaner, TextCleanerConfig

cfg = TextCleanerConfig(
    check_remove_stopwords=True,
    check_remove_punctuation=True,
    check_ner_process=True,
    positional_tags=('PER', 'ORG', 'LOC'),
    ner_confidence_threshold=0.90,
)

cleaner = TextCleaner(cfg=cfg)

# Sample texts
texts = [
    "Email maria.garcia@example.es for more info.",  # Spanish
    "Besuchen Sie uns im Büro in Berlin.",           # German
    "Voor vragen, bel +31 20 123 4567.",             # Dutch
]

# Process texts in batch (uses ThreadPoolExecutor for parallel processing)
results = cleaner.process_batch(texts, batch_size=2)

for lm_text, stat_text, lang in results:
    print(f"Language: {lang}")
    print(f"LM Format:    {lm_text}")
    print(f"Stat Format:  {stat_text}")
    print("-" * 40)

Legacy Configuration (backward compatible)

from sct import sct, config

# Customize settings via module-level variables
config.CHECK_NER_PROCESS = True
config.NER_CONFIDENCE_THRESHOLD = 0.85
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.LANGUAGE = "ENGLISH"

# Initialize (reads from module-level config)
cleaner = sct.TextCleaner()

Note: The legacy module-level configuration is not thread-safe. For concurrent processing, use TextCleanerConfig instead.

NER Backends

SqueakyCleanText supports five NER backends, selectable via the ner_backend config field:

Backend	Description	Dependencies	Best for
`onnx` (default)	ONNX Runtime inference with quantized XLM-RoBERTa models	Base install	Production — fast, torch-free
`torch`	PyTorch/Transformers pipeline with full XLM-RoBERTa models	`[torch]` extra	Compatibility with existing PyTorch workflows
`gliner`	GLiNER zero-shot NER with custom entity labels	`[gliner]` or `[gliner2]` extra	Custom entity types (PRODUCT, SKILL, EVENT, etc.)
`ensemble_onnx`	ONNX + GLiNER ensemble voting	`[gliner]` extra	Maximum recall with custom entities
`ensemble_torch`	Torch + GLiNER ensemble voting	`[torch,gliner]` extra	Maximum recall with PyTorch

Default NER Models (ONNX)

Language	Model
English	`rhnfzl/xlm-roberta-large-conll03-english-onnx`
Dutch	`rhnfzl/xlm-roberta-large-conll02-dutch-onnx`
German	`rhnfzl/xlm-roberta-large-conll03-german-onnx`
Spanish	`rhnfzl/xlm-roberta-large-conll02-spanish-onnx`
French / Portuguese / Italian	`rhnfzl/wikineural-multilingual-ner-onnx` (shared session)
Multilingual (fallback)	`rhnfzl/wikineural-multilingual-ner-onnx`

GLiNER Label Mapping

GLiNER uses lowercase free-text labels (e.g., 'person', 'product'). To map them to standard NER tags used by the anonymizer, use gliner_label_map:

gliner_label_map={
    'person': 'PER',          # → <PERSON>
    'organization': 'ORG',    # → <ORGANISATION>
    'location': 'LOC',        # → <LOCATION>
}
# Unmapped labels are uppercased automatically:
# 'product' → <PRODUCT>, 'event' → <EVENT>, 'skill' → <SKILL>

API

`TextCleaner`

`process(text: str) -> Tuple[str, Optional[str], Optional[str]]`

Processes the input text and returns a tuple containing:

Cleaned text formatted for language models.
Cleaned text formatted for statistical models (None if check_statistical_model_processing is False).
Detected language of the text (None if language detection is disabled).

`process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]`

Processes multiple texts using thread-parallel execution. Each result follows the same format as process().

`aprocess_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]`

Async version of process_batch for use with asyncio-based frameworks (FastAPI, aiohttp). Runs the batch in a thread-pool executor so it does not block the event loop:

from sct import TextCleaner

cleaner = TextCleaner()

# In an async context (FastAPI route, aiohttp handler, etc.)
results = await cleaner.aprocess_batch(texts)

`warmup(languages: Optional[List[str]] = None) -> None`

Pre-loads NER models to avoid first-request latency. Call once during application startup:

cleaner = TextCleaner()
cleaner.warmup(['ENGLISH', 'DUTCH'])  # or warmup() for all supported languages

`TextCleanerConfig`

Immutable (frozen) dataclass. Create modified copies with dataclasses.replace():

import dataclasses
new_cfg = dataclasses.replace(cfg, check_ner_process=False)

Full configuration reference

Pipeline toggles (all bool, default shown):

Field	Default	Description
`check_detect_language`	`True`	Auto-detect language
`check_fix_bad_unicode`	`True`	Fix encoding issues via ftfy
`check_to_ascii_unicode`	`True`	Transliterate to ASCII
`check_replace_html`	`True`	Strip/replace HTML tags
`check_replace_urls`	`True`	Replace URLs with token
`check_replace_emails`	`True`	Replace emails with token
`check_replace_years`	`True`	Replace years (1900-2099)
`check_replace_dates`	`False`	Replace full dates (ISO 8601, month names)
`check_fuzzy_replace_dates`	`False`	Fuzzy match misspelled months (requires `[fuzzy]`)
`check_replace_phone_numbers`	`True`	Replace phone numbers
`check_replace_numbers`	`True`	Replace standalone numbers
`check_replace_currency_symbols`	`True`	Replace currency symbols
`check_ner_process`	`True`	Run NER entity recognition
`check_remove_isolated_letters`	`True`	Remove single letters
`check_remove_isolated_special_symbols`	`True`	Remove isolated symbols
`check_remove_bracket_content`	`True`	Remove `[...]` content
`check_remove_brace_content`	`True`	Remove `{...}` content
`check_normalize_whitespace`	`True`	Normalize whitespace
`check_statistical_model_processing`	`True`	Generate stat model output
`check_casefold`	`True`	Lowercase stat output
`check_smart_casefold`	`False`	Lowercase but preserve NER tokens
`check_remove_stopwords`	`True`	Remove stopwords from stat output
`check_remove_punctuation`	`True`	Remove punctuation from stat output
`check_remove_stext_custom_stop_words`	`True`	Remove custom stop words from stat output
`check_remove_emoji`	`False`	Remove emoji characters

Replacement tokens (all str):

Field	Default
`replace_with_url`	`"<URL>"`
`replace_with_html`	`"<HTML>"`
`replace_with_email`	`"<EMAIL>"`
`replace_with_years`	`"<YEAR>"`
`replace_with_dates`	`"<DATE>"`
`replace_with_phone_numbers`	`"<PHONE>"`
`replace_with_numbers`	`"<NUMBER>"`
`replace_with_currency_symbols`	`None`

NER settings:

Field	Default	Description
`ner_backend`	`'onnx'`	Backend: `onnx`, `torch`, `gliner`, `ensemble_onnx`, `ensemble_torch`
`positional_tags`	`('PER', 'LOC', 'ORG', 'MISC')`	Entity types to recognize
`ner_confidence_threshold`	`0.85`	Minimum confidence score
`ner_batch_size`	`8`	Inference batch size (must be >= 1)
`ner_models`	`None`	Language-keyed dict of ONNX model repo IDs
`torch_ner_models`	`None`	Language-keyed dict of PyTorch model repo IDs
`gliner_model`	`None`	GLiNER model ID (required for gliner/ensemble backends)
`gliner_variant`	`'gliner'`	`'gliner'` or `'gliner2'`
`gliner_labels`	`('person', 'organization', 'location')`	GLiNER entity labels
`gliner_label_map`	`None`	Maps GLiNER labels to NER tags
`gliner_threshold`	`0.4`	GLiNER confidence threshold
`fuzzy_date_score_cutoff`	`85`	Fuzzy matching threshold (0-100) for misspelled months
`custom_pipeline_steps`	`()`	Tuple of `(text: str) -> str` callables appended after all built-in steps

Language settings:

Field	Default	Description
`language`	`None`	Pin language (skip detection)
`extra_languages`	`()`	Additional language names for detection
`custom_stopwords`	`None`	`{LANG: frozenset({...})}` custom stopword sets
`custom_month_names`	`None`	`{LANG: ('Jan', 'Feb', ...)}` for date detection

Architecture

SqueakyCleanText processes text through a configurable pipeline of sequential steps:

Input Text
  │
  ├─ Fix Unicode (ftfy)
  ├─ ASCII transliteration (unidecode)
  ├─ Emoji removal
  ├─ HTML replacement
  ├─ URL / Email / Phone replacement
  ├─ Date & Year replacement
  ├─ Number & Currency replacement
  ├─ Isolated letter/symbol removal
  ├─ Whitespace normalization
  │
  ├─ NER Processing (ONNX / Torch / GLiNER / Ensemble)
  │   ├─ Language detection (Lingua)
  │   ├─ Text chunking (token-bounded)
  │   ├─ Entity recognition (per-chunk)
  │   ├─ Ensemble voting (cross-model)
  │   └─ Entity anonymization (Presidio)
  │
  └─ Statistical Model Output
      ├─ Case folding
      ├─ Stopword removal
      └─ Punctuation removal

  ▼
(lm_text, stat_text, language)

Each step is toggled by a TextCleanerConfig field. The pipeline is built once at initialization — disabled steps are skipped entirely (zero overhead).

What's New in v0.5.0

Quality, performance, and API improvements:

Async & API

aprocess_batch() — async batch processing for FastAPI / aiohttp (uses get_running_loop, Python 3.12+ compatible)
warmup(languages) — public method to pre-load NER models at startup
custom_pipeline_steps config field — plug in arbitrary (text: str) -> str callables after the built-in pipeline

Language Support

French, Portuguese, Italian now supported out of the box via the multilingual ONNX model
ONNX sessions are shared across same-model languages (FR/PT/IT → one session, ~600 MB saved)

Performance & Thread Safety

Per-model inference locks replace the coarse per-language lock — true concurrent inference across different language models
split_text() is now lock-free (HF fast tokenizer is thread-safe)
Conservative chars/token ratio (2×) in _simple_chunk prevents context-window overflow for CJK and Arabic texts

Correctness

SENTENCE_BOUNDARY_PATTERN upgraded to the regex library with an abbreviation guard — "Dr. Smith", "Mr. Jones", "U.S. Army" no longer cause false splits during NER chunking
ner_batch_size=0 and ner_batch_size=-1 now raise ValueError immediately instead of silently producing empty results
Quantized ONNX models are cached to ~/.cache/sct_quantized/ instead of the read-only HuggingFace Hub cache directory

What's New in v0.4.5

Major release with architectural overhaul since v0.3.0:

Architecture

Frozen TextCleanerConfig dataclass replaces global mutable config (thread-safe, per-instance)
ONNX-first NER inference — torch-free base install (~400MB models vs ~7GB)
Thread-parallel batch processing via ThreadPoolExecutor (ONNX releases the GIL)

NER

5 backends: onnx, torch, gliner, ensemble_onnx, ensemble_torch
GLiNER zero-shot NER for custom entity types (PRODUCT, EVENT, SKILL, etc.)
Ensemble voting across backends for improved recall
Lazy per-language model loading (only loads models when needed)
Language-keyed model dict replaces fragile positional tuple
ONNX-quantized models hosted on HuggingFace Hub

Text Processing

Multilingual date detection (ISO 8601, European formats, month names in EN/NL/DE/ES)
Fuzzy date matching for misspelled months (via rapidfuzz, empirically calibrated threshold)
Configurable emoji removal
Configurable bracket/brace content removal
Smart case folding (preserves NER replacement tokens)
Custom stopwords and month names per language

Dependencies

stop-words package replaces NLTK (50KB bundled vs 30MB download)
PyTorch/Transformers moved to optional [torch] extra
New optional extras: [gpu], [fuzzy], [gliner], [gliner2], [all-ner]
Migrated from setup.py to pyproject.toml (PEP 517)

Quality

Python 3.11–3.13 support
ruff linter (replaces flake8)
hypothesis-based property testing with pytest-timeout
Collision-safe NER entity keys

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

The package took inspirations from the following repo:

clean-text

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.1

Feb 28, 2026

0.6.0

Feb 28, 2026

0.5.2

Feb 28, 2026

This version

0.5.1

Feb 23, 2026

0.5.0

Feb 23, 2026

0.4.5

Feb 23, 2026

0.4.1

Feb 23, 2026

0.4.0

Feb 23, 2026

0.3.0

Nov 13, 2024

0.2.6

Aug 17, 2024

0.2.5

Aug 17, 2024

0.2.4

Aug 17, 2024

0.2.3

Aug 17, 2024

0.2.2

Aug 17, 2024

0.2.1

Aug 17, 2024

0.2.0

Aug 17, 2024

0.1.8

Aug 17, 2024

0.1.6

Aug 17, 2024

0.1.5

Aug 17, 2024

0.1.4

Aug 17, 2024

0.1.3

Aug 9, 2024

0.1.2

Jun 16, 2024

0.1.1

Jun 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

squeakycleantext-0.5.1.tar.gz (54.3 kB view details)

Uploaded Feb 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

squeakycleantext-0.5.1-py3-none-any.whl (44.6 kB view details)

Uploaded Feb 23, 2026 Python 3

File details

Details for the file squeakycleantext-0.5.1.tar.gz.

File metadata

Download URL: squeakycleantext-0.5.1.tar.gz
Upload date: Feb 23, 2026
Size: 54.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for squeakycleantext-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`812ab8bdd1a4ea5bac19f793d308810843224ba93005b81d75a9020b1d2f3b33`
MD5	`c99605de4384e05ff48e4be518eaabc8`
BLAKE2b-256	`a117e195367646a02081ea4bde9ce7a1b6cf53d497d88feaaa86979cfbe2acf3`

See more details on using hashes here.

File details

Details for the file squeakycleantext-0.5.1-py3-none-any.whl.

File metadata

Download URL: squeakycleantext-0.5.1-py3-none-any.whl
Upload date: Feb 23, 2026
Size: 44.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for squeakycleantext-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fab83fe17376b13b909168511af2723a8912eed3e41064163328b6c4f51106cf`
MD5	`ab14b647d5a7027a7658418055d47e59`
BLAKE2b-256	`7814a8bbb96d3ef03c5cac5f8a9f45cd4e30d858a2984bdec7f4ecde62ad348d`

See more details on using hashes here.

SqueakyCleanText 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SqueakyCleanText

Key Features

Benefits

For Language Models

For Statistical Models

Advanced NER Processing

Installation

Optional Extras

Usage

Basic Usage

Using TextCleanerConfig

GLiNER: Zero-Shot Custom NER

Ensemble NER

Batch Processing

NER Backends

Default NER Models (ONNX)

GLiNER Label Mapping

API

TextCleaner

process(text: str) -> Tuple[str, Optional[str], Optional[str]]

process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]

aprocess_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]

warmup(languages: Optional[List[str]] = None) -> None

TextCleanerConfig

Architecture

What's New in v0.5.0

What's New in v0.4.5

Contributing

License

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`TextCleaner`

`process(text: str) -> Tuple[str, Optional[str], Optional[str]]`

`process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]`

`aprocess_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]`

`warmup(languages: Optional[List[str]] = None) -> None`

`TextCleanerConfig`