Skip to main content

One-call text cleanup: invisible characters, smart quotes, whitespace normalization.

Project description

cleanmonkey

One-call text cleanup for invisible characters, smart quotes, and whitespace normalization.

Install

pip install cleanmonkey

Quick Start

from cleanmonkey import clean

# Sensible defaults handle the common garbage
clean("hello\u00a0world\u2019s \u2014 test")
# → "hello world's - test"

# Idempotent — safe to call twice
clean(clean(text)) == clean(text)

What It Cleans (by default)

Category Examples Result
Non-breaking spaces \u00a0, \u2007, \u202f Regular space
Zero-width chars \u200b, \u200c, \u200d, \ufeff Removed
Smart quotes \u2018 \u2019 \u201c \u201d ' and "
Dashes \u2013 (en), \u2014 (em) -
Ellipsis \u2026 ...
Control chars null, form feed, vertical tab Removed
Line endings \r\n, \r \n
Multiple spaces "hello world" "hello world"
Leading/trailing " hello " "hello"

Granular Control

Override any default:

clean(text, smart_quotes=False)       # keep curly quotes
clean(text, dashes=False)             # keep em/en dashes
clean(text, fullwidth=True)           # also normalize fullwidth digits/letters
clean(text, collapse_spaces=False)    # keep multiple spaces
clean(text, strip=False)              # keep leading/trailing whitespace

Profiles

clean(text, profile="default")     # all normalizations (the default)
clean(text, profile="csv")         # default + fullwidth normalization
clean(text, profile="sql")         # default + fullwidth normalization
clean(text, profile="display")     # keep smart quotes & dashes; still clean invisible, control, whitespace, line endings
clean(text, profile="minimal")     # invisible chars only, no collapsing or stripping
clean(text, profile="aggressive")  # everything including fullwidth

Batch Helpers

from cleanmonkey import clean_column, clean_dict

# Clean a list (non-strings pass through)
clean_column(["hello\u00a0world", 42, None])
# → ["hello world", 42, None]

# Recursively clean dict values
clean_dict({"name": "John\u00a0Doe", "nested": {"val": "test\u200b"}})
# → {"name": "John Doe", "nested": {"val": "test"}}

# Also clean keys
clean_dict({"key\u00a0name": "val"}, keys=True)
# → {"key name": "val"}

Inspect

Find out what's lurking in your text:

from cleanmonkey import inspect

for info in inspect("hello\u00a0world\u200b"):
    print(f"{info.codepoint} {info.name} count={info.count} at {info.positions}")
# U+00A0 NO-BREAK SPACE count=1 at [5]
# U+200B ZERO WIDTH SPACE count=1 at [11]

CLI

# Clean a file
cleanmonkey input.txt -o output.txt

# Pipe through stdin
cat dirty.csv | cleanmonkey > clean.csv

# Use a profile
cleanmonkey --profile csv input.txt

# Inspect mode — report what's in a file
cleanmonkey --inspect input.txt

# Machine-readable JSON inspect output
cleanmonkey --json input.txt

# Selective overrides
cleanmonkey --no-smart-quotes --fullwidth input.txt

# Preserve whitespace structure
cleanmonkey --no-strip --no-collapse-spaces input.txt

# Preserve line endings (CR/CRLF)
cleanmonkey --no-line-endings input.txt

Built for LLMs

cleanmonkey is designed to work well as a tool for large language models. Invisible character cleanup is a constant source of silent bugs in LLM-driven data pipelines — non-breaking spaces break splits, zero-width characters corrupt comparisons, and smart quotes fail exact matches. Without cleanmonkey, LLMs end up generating repetitive .replace() chains that miss edge cases and waste tokens. A single clean() call handles all of it with a structured, idempotent result — no multi-step prompting or character-by-character debugging required. Fewer tokens in, clean data out.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanmonkey-0.1.0.tar.gz (31.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanmonkey-0.1.0-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file cleanmonkey-0.1.0.tar.gz.

File metadata

  • Download URL: cleanmonkey-0.1.0.tar.gz
  • Upload date:
  • Size: 31.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cleanmonkey-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8a7feddcb79d08ec48a0bd56c7c187a07090800ffc27d9033d1972ff58e2f572
MD5 17b81d5941df668a64439f06914bee10
BLAKE2b-256 2b21134420533b43cabf51b0b1c2a8af27409baff8a2e1d0bb067e2ac31ee728

See more details on using hashes here.

File details

Details for the file cleanmonkey-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cleanmonkey-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for cleanmonkey-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 575914eb151c0059d0a28bd1e88d9f271ca346235fb77f64ccdce3dc0138bec6
MD5 156a7bc406ff70d2aaeb9757f8569b8b
BLAKE2b-256 275cfdc27a9fc865525f62cda91a914afec91d0f6d56578416f2270afc0050bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page