Text preprocessing & PII anonymization pipeline for NLP/ML: ONNX NER ensemble, language detection, stopword removal, and configurable token replacement.
Project description
SqueakyCleanText
A comprehensive text cleaning and preprocessing pipeline for machine learning and NLP tasks.
Using an AI coding assistant? This repo includes an
llms.txtwith the full API surface, config reference, and Q&A - optimised for Claude, Cursor, Copilot, and ChatGPT.
In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.
SqueakyCleanText simplifies the process by automatically addressing common text issues - removing PII, anonymizing named entities (persons, organisations, locations), and ensuring your data is clean and well-structured for language models and classical ML pipelines with minimal effort on your part.
Key Features
- Named Entity Recognition (NER):
- Multi-backend: ONNX (default, torch-free), PyTorch, GLiNER, and ensemble modes
- Zero-shot custom entities via GLiNER (e.g., PRODUCT, EVENT, SKILL)
- Multi-language support (English, Dutch, German, Spanish, French, Portuguese, Italian)
- Ensemble voting across backends for improved accuracy
- Configurable confidence thresholds
- Lazy model loading (models load on demand per language)
- Shared ONNX sessions across same-model languages (~600 MB RAM saved)
- Automatic text chunking for long documents (CJK/Arabic safe)
- GPU acceleration support (CUDA for ONNX and PyTorch)
- Model warm-up API to pre-load on startup
- Text Normalization:
- Corrects text encoding problems and handles bad Unicode characters
- Removes or replaces HTML tags and URLs with configurable tokens
- Handles emails, phone numbers, and other contact details
- Multilingual date detection and replacement (ISO 8601, month names, common formats)
- Fuzzy date matching for misspelled months (requires
[fuzzy]extra) - Year and number standardization
- Configurable emoji removal
- Configurable bracket/brace content removal
- Removes isolated letters and symbols
- Normalizes whitespace and handles currency symbols
- Smart case folding (preserves NER tokens like
<PERSON>)
- Language Support:
- Automatic language detection (English, Dutch, German, Spanish)
- Language-specific NER models; French, Portuguese, Italian via multilingual model
- Language-aware stopword removal
- Extensible: add custom languages with stopwords, month names, and NER models
- Dual Output Formats:
- Language Model format (preserves structure with tokens)
- Statistical Model format (optimized for classical ML)
- Performance:
- ONNX Runtime inference (torch-free base install, ~3-5x faster than PyTorch)
- Thread-parallel batch processing via
ThreadPoolExecutor - Async batch processing (
aprocess_batch) for FastAPI / aiohttp - Lazy model loading (only loads models as needed)
- Shared ONNX sessions for same-model languages (saves ~600 MB for FR/PT/IT)
- Memory-efficient processing of large texts
- GPU acceleration (CUDA) for both ONNX and PyTorch backends
Benefits
For Language Models
- Maintains text structure while anonymizing sensitive information
- Configurable token replacements
- Preserves context while removing noise
- Handles long documents through intelligent chunking
For Statistical Models
- Removes stopwords and punctuation
- Case normalization
- Special symbol removal
- Optimized for classification tasks
Advanced NER Processing
- Ensemble approach reduces missed entities
- Language-specific models improve accuracy
- Confidence thresholds for precision control
- Efficient batch processing for large datasets
- Automatic handling of long documents
Installation
pip install SqueakyCleanText
The base install uses ONNX Runtime for NER inference - no PyTorch or Transformers required.
Optional Extras
| Extra | Command | What it adds |
|---|---|---|
| GPU | pip install SqueakyCleanText[gpu] |
CUDA-accelerated ONNX inference |
| Fuzzy dates | pip install SqueakyCleanText[fuzzy] |
Fuzzy month name matching (rapidfuzz) |
| PyTorch NER | pip install SqueakyCleanText[torch] |
PyTorch/Transformers NER backend |
| GLiNER | pip install SqueakyCleanText[gliner] |
GLiNER zero-shot NER |
| GLiNER2 | pip install SqueakyCleanText[gliner2] |
GLiNER2 (knowledgator) backend |
| Synthetic | pip install SqueakyCleanText[synthetic] |
Faker-based synthetic replacement (realistic fake values instead of <TAG> tokens) |
| Presidio | pip install SqueakyCleanText[presidio] |
Presidio-analyzer for presidio_gliner backend |
| Classify | pip install SqueakyCleanText[classify] |
GLiClass document-level pre-classification |
| All NER | pip install SqueakyCleanText[all-ner] |
All NER backends combined |
| Development | pip install SqueakyCleanText[dev] |
Testing and linting tools |
You can combine extras: pip install SqueakyCleanText[gpu,fuzzy,gliner]
Usage
Basic Usage
from sct import TextCleaner
# Initialize the TextCleaner
cleaner = TextCleaner()
# Input text
text = "Contact John Doe at john.doe@company.com. Meeting on 2023-10-01."
# Process the text
lm_text, stat_text, lang = cleaner.process(text)
print(f"Language Model format: {lm_text}")
# Output: "Contact <PERSON> at <EMAIL>. Meeting on <YEAR>."
print(f"Statistical Model format: {stat_text}")
# Output: "contact meeting"
print(f"Detected Language: {lang}")
# Output: "ENGLISH"
Using TextCleanerConfig
from sct import TextCleaner, TextCleanerConfig
# Create an immutable configuration
cfg = TextCleanerConfig(
check_ner_process=True,
ner_confidence_threshold=0.85,
positional_tags=('PER', 'LOC', 'ORG', 'MISC'),
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_numbers="<PHONE>",
language="en", # Pin to English (also accepts 'ENGLISH', 'eng')
)
# Initialize with config
cleaner = TextCleaner(cfg=cfg)
Language Specification
All language parameters accept Lingua names ('ENGLISH'), ISO 639-1 ('en'), or ISO 639-3 ('eng') codes:
# Pin to one language (skip auto-detection)
cfg = TextCleanerConfig(language='de', check_ner_process=False)
# Restrict detection to specific languages (auto-detect among them)
cfg = TextCleanerConfig(language=('en', 'nl', 'de'), check_ner_process=False)
# Add extra languages for detection
cfg = TextCleanerConfig(extra_languages=('fr', 'pt'), check_ner_process=False)
GLiNER: Zero-Shot Custom NER
Use GLiNER to recognize any entity type without retraining:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
ner_backend='gliner',
gliner_model='urchade/gliner_large-v2.1',
gliner_labels=('person', 'organization', 'location', 'product', 'event'),
gliner_label_map={
'person': 'PER', 'organization': 'ORG', 'location': 'LOC',
# 'product' and 'event' are unmapped - they become <PRODUCT>, <EVENT> tokens
},
gliner_threshold=0.4,
)
cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
"John bought an iPhone at the Apple Store in Berlin during CES 2025."
)
# lm_text: "<PERSON> bought an <PRODUCT> at the <ORGANISATION> in <LOCATION> during <EVENT>."
Ensemble NER
Combine ONNX/Torch models with GLiNER for improved recall via ensemble voting:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
ner_backend='ensemble_onnx', # or 'ensemble_torch'
gliner_model='urchade/gliner_large-v2.1',
gliner_labels=('person', 'organization', 'location'),
gliner_label_map={'person': 'PER', 'organization': 'ORG', 'location': 'LOC'},
)
cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process("Angela Merkel visited the Bundestag in Berlin.")
PII Detection Mode
Automatically configure GLiNER for comprehensive PII detection with 60+ entity types (personal, financial, healthcare, identity, digital):
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(ner_mode='pii')
cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
"John Smith's SSN is 123-45-6789, email john@example.com, DOB 1990-01-15"
)
# Entities are anonymized: names, SSNs, emails, dates of birth, and 50+ more PII types
PII mode auto-configures: ner_backend='gliner', uses knowledgator/gliner-pii-base-v1.0, sets threshold to 0.3 (recall-focused), and expands positional tags. User-provided values always take priority.
Alternative PII models (pass as gliner_model):
| Model | Type | Size | Labels | F1 |
|---|---|---|---|---|
knowledgator/gliner-pii-base-v1.0 |
Uni-encoder | 330MB (ONNX FP16) | 60+ | 80.99% |
nvidia/gliner-PII |
Bi-encoder | 570MB | 55+ | — |
gretelai/gretel-gliner-bi-base-v1.0 |
Bi-encoder | ~800MB | 40+ | 95% |
urchade/gliner_multi_pii-v1 |
Multilingual | — | — | — |
Synthetic Replacement
Replace detected entities with realistic fake values (via Faker) instead of <TAG> placeholder tokens:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
ner_mode='pii',
replacement_mode='synthetic', # pip install squeakycleantext[synthetic]
)
cleaner = TextCleaner(cfg=cfg)
lm_text, stat_text, lang = cleaner.process(
"Contact John Smith at john.smith@company.com or +1-555-0123"
)
# Output: "Contact Jennifer Williams at lisa45@example.net or +1-555-0198"
# Same entity always maps to same fake value within a document
Note: Synthetic replacement preserves data utility for downstream ML tasks but is NOT GDPR-compliant anonymization. Same-document consistency is maintained (same entity text always maps to the same fake value).
Reversible Anonymization
Replace entities with indexed placeholders (<PERSON_0>, <LOCATION_1>) and get a mapping for round-trip deanonymization:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
ner_mode='pii',
replacement_mode='reversible',
)
cleaner = TextCleaner(cfg=cfg)
result = cleaner.process("John Smith works at Google in London.")
print(result.lm_text)
# "<PERSON_0> works at <ORGANISATION_0> in <LOCATION_0>."
# Access the anonymization map via metadata
anon_map = result.metadata['anon_map']
restored = anon_map.deanonymize(result.lm_text)
# "John Smith works at Google in London."
# Serialize the map for storage
import json
json.dumps(anon_map.to_dict())
Note:
ProcessResultfromprocess()unpacks as a 3-tuple (lm_text, stat_text, language) for backward compatibility, but also exposes.metadatafor reversible maps and document classification.
Document Classification (GLiClass)
Classify documents before processing using zero-shot classification with GLiClass:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
check_classify_document=True,
gliclass_labels=('email', 'code', 'legal', 'medical'),
# gliclass_model defaults to 'knowledgator/gliclass-edge-v3.0' (32.7M params)
)
cleaner = TextCleaner(cfg=cfg) # pip install squeakycleantext[classify]
result = cleaner.process("Dear Sir, please find attached the contract...")
# Classification results in metadata
print(result.metadata['classes'])
# [{"label": "email", "score": 0.92}, {"label": "legal", "score": 0.78}]
Bi-Encoder GLiNER Models
Bi-encoder models (ModernBERT, etc.) are auto-detected and leverage pre-computed label embeddings for faster inference with larger context windows:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
ner_backend='gliner',
gliner_model='knowledgator/gliner-bi-base-v2.0',
gliner_labels=('person', 'organization', 'location'),
)
cleaner = TextCleaner(cfg=cfg)
# Auto-detects bi-encoder → caches label embeddings → uses 2048+ token context window
Entity Description Labels (ZERONER-Style)
Provide natural-language descriptions for labels to improve zero-shot recognition accuracy:
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
ner_backend='gliner',
gliner_model='knowledgator/gliner-bi-base-v2.0',
gliner_label_descriptions={
'person': "a person's full legal name",
'location': "a geographical place or address",
'organization': "a company, institution, or government body",
},
)
cleaner = TextCleaner(cfg=cfg)
# Descriptions are used for inference, results are mapped back to original label names
Batch Processing
from sct import TextCleaner, TextCleanerConfig
cfg = TextCleanerConfig(
check_remove_stopwords=True,
check_remove_punctuation=True,
check_ner_process=True,
positional_tags=('PER', 'ORG', 'LOC'),
ner_confidence_threshold=0.90,
)
cleaner = TextCleaner(cfg=cfg)
# Sample texts
texts = [
"Email maria.garcia@example.es for more info.", # Spanish
"Besuchen Sie uns im Büro in Berlin.", # German
"Voor vragen, bel +31 20 123 4567.", # Dutch
]
# Process texts in batch (uses ThreadPoolExecutor for parallel processing)
results = cleaner.process_batch(texts, batch_size=2)
for lm_text, stat_text, lang in results:
print(f"Language: {lang}")
print(f"LM Format: {lm_text}")
print(f"Stat Format: {stat_text}")
print("-" * 40)
Legacy Configuration (backward compatible)
from sct import sct, config
# Customize settings via module-level variables
config.CHECK_NER_PROCESS = True
config.NER_CONFIDENCE_THRESHOLD = 0.85
config.POSITIONAL_TAGS = ['PER', 'LOC', 'ORG']
config.REPLACE_WITH_URL = "<URL>"
config.REPLACE_WITH_EMAIL = "<EMAIL>"
config.LANGUAGE = "ENGLISH"
# Initialize (reads from module-level config)
cleaner = sct.TextCleaner()
Note: The legacy module-level configuration is not thread-safe. For concurrent processing, use
TextCleanerConfiginstead.
NER Backends
SqueakyCleanText supports six NER backends, selectable via the ner_backend config field:
| Backend | Description | Dependencies | Best for |
|---|---|---|---|
onnx (default) |
ONNX Runtime inference with quantized XLM-RoBERTa models | Base install | Production: fast, torch-free |
torch |
PyTorch/Transformers pipeline with full XLM-RoBERTa models | [torch] extra |
Compatibility with existing PyTorch workflows |
gliner |
GLiNER zero-shot NER with custom entity labels | [gliner] or [gliner2] extra |
Custom entity types, PII detection, bi-encoder models |
ensemble_onnx |
ONNX + GLiNER ensemble voting | [gliner] extra |
Maximum recall with custom entities |
ensemble_torch |
Torch + GLiNER ensemble voting | [torch,gliner] extra |
Maximum recall with PyTorch |
presidio_gliner |
Presidio + GLiNER recognizer (beta) | presidio-analyzer, [gliner] |
Context-aware NER via Presidio's pipeline |
Default NER Models (ONNX)
| Language | Model |
|---|---|
| English | rhnfzl/xlm-roberta-large-conll03-english-onnx |
| Dutch | rhnfzl/xlm-roberta-large-conll02-dutch-onnx |
| German | rhnfzl/xlm-roberta-large-conll03-german-onnx |
| Spanish | rhnfzl/xlm-roberta-large-conll02-spanish-onnx |
| French / Portuguese / Italian | rhnfzl/wikineural-multilingual-ner-onnx (shared session) |
| Multilingual (fallback) | rhnfzl/wikineural-multilingual-ner-onnx |
GLiNER Model Recommendations
| Model | Architecture | Context | Languages | Best for |
|---|---|---|---|---|
knowledgator/gliner-bi-base-v2.0 |
Bi-encoder (ModernBERT) | 2048 | Multi | General NER, long documents |
knowledgator/gliner-pii-base-v1.0 |
Bi-encoder | 2048 | Multi | PII detection (60+ entity types) |
urchade/gliner_large-v2.1 |
Uni-encoder (DeBERTa) | 512 | Multi | Legacy, high accuracy on short texts |
MatteoFasulo/ModernBERT-base-NER |
ModernBERT | 8192 | English | English-only, very long context |
GLiNER2 note:
pip install squeakycleantext[gliner2]installs Knowledgator's gliner2 package, not Fastino AI's GLiNER2 from EMNLP 2025 (different API).
GLiNER Label Mapping
GLiNER uses lowercase free-text labels (e.g., 'person', 'product'). To map them to standard NER tags used by the anonymizer, use gliner_label_map:
gliner_label_map={
'person': 'PER', # → <PERSON>
'organization': 'ORG', # → <ORGANISATION>
'location': 'LOC', # → <LOCATION>
}
# Unmapped labels are uppercased automatically:
# 'product' → <PRODUCT>, 'event' → <EVENT>, 'skill' → <SKILL>
API
TextCleaner
process(text: str) -> Tuple[str, Optional[str], Optional[str]]
Processes the input text and returns a tuple containing:
- Cleaned text formatted for language models.
- Cleaned text formatted for statistical models (
Noneifcheck_statistical_model_processingisFalse). - Detected language of the text (
Noneif language detection is disabled).
process_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]
Processes multiple texts using thread-parallel execution. Each result follows the same format as process().
aprocess_batch(texts: List[str], batch_size: int = None) -> List[Tuple[str, Optional[str], Optional[str]]]
Async version of process_batch for use with asyncio-based frameworks (FastAPI, aiohttp). Runs the batch in a thread-pool executor so it does not block the event loop:
from sct import TextCleaner
cleaner = TextCleaner()
# In an async context (FastAPI route, aiohttp handler, etc.)
results = await cleaner.aprocess_batch(texts)
warmup(languages: Optional[List[str]] = None) -> None
Pre-loads NER models to avoid first-request latency. Call once during application startup:
cleaner = TextCleaner()
cleaner.warmup(['ENGLISH', 'DUTCH']) # or warmup() for all supported languages
TextCleanerConfig
Immutable (frozen) dataclass. Create modified copies with dataclasses.replace():
import dataclasses
new_cfg = dataclasses.replace(cfg, check_ner_process=False)
Full configuration reference
Pipeline toggles (all bool, default shown):
| Field | Default | Description |
|---|---|---|
check_detect_language |
True |
Auto-detect language |
check_fix_bad_unicode |
True |
Fix encoding issues via ftfy |
check_to_ascii_unicode |
True |
Transliterate to ASCII |
check_replace_html |
True |
Strip/replace HTML tags |
check_replace_urls |
True |
Replace URLs with token |
check_replace_emails |
True |
Replace emails with token |
check_replace_years |
True |
Replace years (1900-2099) |
check_replace_dates |
False |
Replace full dates (ISO 8601, month names) |
check_fuzzy_replace_dates |
False |
Fuzzy match misspelled months (requires [fuzzy]) |
check_replace_phone_numbers |
True |
Replace phone numbers |
check_replace_numbers |
True |
Replace standalone numbers |
check_replace_currency_symbols |
True |
Replace currency symbols |
check_ner_process |
True |
Run NER entity recognition |
check_remove_isolated_letters |
True |
Remove single letters |
check_remove_isolated_special_symbols |
True |
Remove isolated symbols |
check_remove_bracket_content |
True |
Remove [...] content |
check_remove_brace_content |
True |
Remove {...} content |
check_normalize_whitespace |
True |
Normalize whitespace |
check_statistical_model_processing |
True |
Generate stat model output |
check_casefold |
True |
Lowercase stat output |
check_smart_casefold |
False |
Lowercase but preserve NER tokens |
check_remove_stopwords |
True |
Remove stopwords from stat output |
check_remove_punctuation |
True |
Remove punctuation from stat output |
check_remove_stext_custom_stop_words |
True |
Remove custom stop words from stat output |
check_remove_emoji |
False |
Remove emoji characters |
Replacement tokens (all str):
| Field | Default |
|---|---|
replace_with_url |
"<URL>" |
replace_with_html |
"<HTML>" |
replace_with_email |
"<EMAIL>" |
replace_with_years |
"<YEAR>" |
replace_with_dates |
"<DATE>" |
replace_with_phone_numbers |
"<PHONE>" |
replace_with_numbers |
"<NUMBER>" |
replace_with_currency_symbols |
None |
NER settings:
| Field | Default | Description |
|---|---|---|
ner_backend |
'onnx' |
Backend: onnx, torch, gliner, ensemble_onnx, ensemble_torch, presidio_gliner |
ner_mode |
'standard' |
'standard' or 'pii' (auto-configures GLiNER for PII detection) |
replacement_mode |
'placeholder' |
'placeholder', 'synthetic' (Faker), or 'reversible' (indexed placeholders + deanonymize map) |
positional_tags |
('PER', 'LOC', 'ORG', 'MISC') |
Entity types to recognize |
ner_confidence_threshold |
0.85 |
Minimum confidence score |
ner_batch_size |
8 |
Inference batch size (must be >= 1) |
ner_models |
None |
Language-keyed dict of ONNX model repo IDs |
torch_ner_models |
None |
Language-keyed dict of PyTorch model repo IDs |
gliner_model |
None |
GLiNER model ID (required for gliner/ensemble backends) |
gliner_variant |
'gliner' |
'gliner' or 'gliner2' |
gliner_labels |
('person', 'organization', 'location') |
GLiNER entity labels |
gliner_label_map |
None |
Maps GLiNER labels to NER tags |
gliner_threshold |
0.4 |
GLiNER confidence threshold |
gliner_label_descriptions |
None |
ZERONER-style: {label: "description"} for improved zero-shot accuracy |
fuzzy_date_score_cutoff |
85 |
Fuzzy matching threshold (0-100) for misspelled months |
custom_pipeline_steps |
() |
Tuple of (text: str) -> str callables appended after all built-in steps |
Language settings:
| Field | Default | Description |
|---|---|---|
language |
None |
Pin language ('en'), restrict detection to a set (('en','nl')), or None for auto-detect. Accepts Lingua names, ISO 639-1, ISO 639-3 codes. |
extra_languages |
() |
Additional language names/codes for detection |
custom_stopwords |
None |
{LANG: frozenset({...})} custom stopword sets |
custom_month_names |
None |
{LANG: ('Jan', 'Feb', ...)} for date detection |
Architecture
SqueakyCleanText processes text through a configurable pipeline of sequential steps:
Input Text
│
├─ Fix Unicode (ftfy)
├─ ASCII transliteration (unidecode)
├─ Emoji removal
├─ HTML replacement
├─ URL / Email / Phone replacement
├─ Date & Year replacement
├─ Number & Currency replacement
├─ Isolated letter/symbol removal
├─ Whitespace normalization
│
├─ NER Processing (ONNX / Torch / GLiNER / Ensemble)
│ ├─ Language detection (Lingua)
│ ├─ Text chunking (token-bounded)
│ ├─ Entity recognition (per-chunk)
│ ├─ Ensemble voting (cross-model)
│ └─ Entity anonymization (Presidio)
│
└─ Statistical Model Output
├─ Case folding
├─ Stopword removal
└─ Punctuation removal
▼
(lm_text, stat_text, language)
Each step is toggled by a TextCleanerConfig field. The pipeline is built once at initialization; disabled steps are skipped entirely (zero overhead).
What's New
v0.6.0
- PII detection mode (
ner_mode='pii'): auto-configures GLiNER with 60+ PII entity labels (personal, financial, healthcare, identity, digital) - Synthetic replacement (
replacement_mode='synthetic'): Faker-generated realistic values instead of<TAG>placeholders, with per-document consistency - Reversible anonymization (
replacement_mode='reversible'): indexed placeholders (<PERSON_0>) withAnonymizationMapfor round-trip deanonymization - Document classification (
check_classify_document=True): zero-shot GLiClass pre-classification before text processing - ProcessResult:
process()returnsProcessResult(backward-compatible 3-tuple) with.metadatafor anonymization maps and classification results - GLiNER ONNX mode (
gliner_onnx=True): load GLiNER with pre-built ONNX weights from HuggingFace Hub (auto-set for PII + ONNX backend) - Bi-encoder support: auto-detects ModernBERT and other bi-encoder GLiNER models, caches label embeddings, dynamic context windows (2048-8192 tokens)
- Entity description labels: ZERONER-style natural-language descriptions for improved zero-shot accuracy
- Presidio GLiNER backend (beta): opt-in
ner_backend='presidio_gliner'for Presidio's context-aware recognition pipeline - ModernBERT ONNX export: updated export script with ModernBERT support (English, 8192 token context)
- Dynamic chunk sizing: GLiNER chunk size adapts to model's actual context window instead of hardcoded 384
v0.5.x
aprocess_batch(): async batch processing for FastAPI / aiohttp integrationswarmup(languages): pre-load NER models at startup to eliminate first-request latencycustom_pipeline_steps: attach arbitrary(text: str) -> strcallables after the built-in pipeline- French, Portuguese, and Italian NER support via a shared multilingual ONNX session
- Improved NER sentence boundary detection with abbreviation guard
v0.4.5
- Frozen
TextCleanerConfigdataclass: immutable, thread-safe, per-instance configuration - ONNX-first NER inference: torch-free base install (~400 MB models vs ~7 GB)
- Thread-parallel batch processing via
ThreadPoolExecutor - Five NER backends:
onnx,torch,gliner,ensemble_onnx,ensemble_torch - GLiNER zero-shot NER for custom entity types (PRODUCT, EVENT, SKILL, etc.)
- Ensemble voting across backends for improved recall
- Lazy per-language model loading
- Multilingual date detection and fuzzy date matching
- Configurable emoji removal, bracket/brace content removal, and smart case folding
stop-wordsreplaces NLTK (50 KB bundled vs 30 MB download)- PyTorch and Transformers moved to optional extras
- Migrated to
pyproject.toml(PEP 517), Python 3.11-3.13, ruff linter
Contributing
Contributions are welcome! Please feel free to submit a Pull Request or open an issue.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
The package took inspirations from the following repo:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file squeakycleantext-0.6.1.tar.gz.
File metadata
- Download URL: squeakycleantext-0.6.1.tar.gz
- Upload date:
- Size: 88.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
61e90015a19c6b2fae8a6497653428c43b2610c4c8be0e84a64561c09604fa18
|
|
| MD5 |
eac841e23a9908236a5678cf742c5766
|
|
| BLAKE2b-256 |
72af259962efb3be063a4ff0c70c80ac27b8fe11858a791fb9e2bdd83fa59af0
|
File details
Details for the file squeakycleantext-0.6.1-py3-none-any.whl.
File metadata
- Download URL: squeakycleantext-0.6.1-py3-none-any.whl
- Upload date:
- Size: 57.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6d66bfc9794e197c45cc808b9781a69f6bcabd24114752aa956807c3dcac431
|
|
| MD5 |
3422b9f3dd85cb18a9a77dd777691aeb
|
|
| BLAKE2b-256 |
5bf96656c30e8986f2a3d94345a538fab0dcd98b4912b3910ec2258be7d2af56
|