One-call text cleanup: invisible characters, smart quotes, whitespace normalization.
Project description
cleanmonkey
One-call text cleanup for invisible characters, smart quotes, and whitespace normalization.
Install
pip install cleanmonkey
Quick Start
from cleanmonkey import clean
# Sensible defaults handle the common garbage
clean("hello\u00a0world\u2019s \u2014 test")
# → "hello world's - test"
# Idempotent — safe to call twice
clean(clean(text)) == clean(text)
What It Cleans (by default)
| Category | Examples | Result |
|---|---|---|
| Non-breaking spaces | \u00a0, \u2007, \u202f |
Regular space |
| Zero-width chars | \u200b, \u200c, \u200d, \ufeff |
Removed |
| Smart quotes | \u2018 \u2019 \u201c \u201d |
' and " |
| Dashes | \u2013 (en), \u2014 (em) |
- |
| Ellipsis | \u2026 |
... |
| Control chars | null, form feed, vertical tab | Removed |
| Line endings | \r\n, \r |
\n |
| Multiple spaces | "hello world" |
"hello world" |
| Leading/trailing | " hello " |
"hello" |
Granular Control
Override any default:
clean(text, smart_quotes=False) # keep curly quotes
clean(text, dashes=False) # keep em/en dashes
clean(text, fullwidth=True) # also normalize fullwidth digits/letters
clean(text, collapse_spaces=False) # keep multiple spaces
clean(text, strip=False) # keep leading/trailing whitespace
Profiles
clean(text, profile="default") # all normalizations (the default)
clean(text, profile="csv") # default + fullwidth normalization
clean(text, profile="sql") # default + fullwidth normalization
clean(text, profile="display") # keep smart quotes & dashes; still clean invisible, control, whitespace, line endings
clean(text, profile="minimal") # invisible chars only, no collapsing or stripping
clean(text, profile="aggressive") # everything including fullwidth
Batch Helpers
from cleanmonkey import clean_column, clean_dict
# Clean a list (non-strings pass through)
clean_column(["hello\u00a0world", 42, None])
# → ["hello world", 42, None]
# Recursively clean dict values
clean_dict({"name": "John\u00a0Doe", "nested": {"val": "test\u200b"}})
# → {"name": "John Doe", "nested": {"val": "test"}}
# Also clean keys
clean_dict({"key\u00a0name": "val"}, keys=True)
# → {"key name": "val"}
Inspect
Find out what's lurking in your text:
from cleanmonkey import inspect
for info in inspect("hello\u00a0world\u200b"):
print(f"{info.codepoint} {info.name} count={info.count} at {info.positions}")
# U+00A0 NO-BREAK SPACE count=1 at [5]
# U+200B ZERO WIDTH SPACE count=1 at [11]
CLI
# Clean a file
cleanmonkey input.txt -o output.txt
# Pipe through stdin
cat dirty.csv | cleanmonkey > clean.csv
# Use a profile
cleanmonkey --profile csv input.txt
# Inspect mode — report what's in a file
cleanmonkey --inspect input.txt
# Machine-readable JSON inspect output
cleanmonkey --json input.txt
# Selective overrides
cleanmonkey --no-smart-quotes --fullwidth input.txt
# Preserve whitespace structure
cleanmonkey --no-strip --no-collapse-spaces input.txt
# Preserve line endings (CR/CRLF)
cleanmonkey --no-line-endings input.txt
Built for LLMs
cleanmonkey is designed to work well as a tool for large language models. Invisible character cleanup is a constant source of silent bugs in LLM-driven data pipelines — non-breaking spaces break splits, zero-width characters corrupt comparisons, and smart quotes fail exact matches. Without cleanmonkey, LLMs end up generating repetitive .replace() chains that miss edge cases and waste tokens. A single clean() call handles all of it with a structured, idempotent result — no multi-step prompting or character-by-character debugging required. Fewer tokens in, clean data out.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cleanmonkey-0.1.0.tar.gz.
File metadata
- Download URL: cleanmonkey-0.1.0.tar.gz
- Upload date:
- Size: 31.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a7feddcb79d08ec48a0bd56c7c187a07090800ffc27d9033d1972ff58e2f572
|
|
| MD5 |
17b81d5941df668a64439f06914bee10
|
|
| BLAKE2b-256 |
2b21134420533b43cabf51b0b1c2a8af27409baff8a2e1d0bb067e2ac31ee728
|
File details
Details for the file cleanmonkey-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cleanmonkey-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
575914eb151c0059d0a28bd1e88d9f271ca346235fb77f64ccdce3dc0138bec6
|
|
| MD5 |
156a7bc406ff70d2aaeb9757f8569b8b
|
|
| BLAKE2b-256 |
275cfdc27a9fc865525f62cda91a914afec91d0f6d56578416f2270afc0050bd
|