Utilities for cleaning and normalizing raw LLM output

These details have not been verified by PyPI

Project links

Homepage

Project description

llmclean

Small Python library for cleaning the noise out of raw LLM output. Strips markdown fences, repairs malformed JSON, trims runaway repetitions, removes reasoning traces and conversational filler, flattens markdown to prose, normalizes invisible/typographic Unicode, and reports model degeneration instead of silently hiding it. Zero runtime dependencies — pure standard library.

I built this because my other projects (Sakhi, Resume-parser) kept reinventing the same five or six regex passes against the same recurring failure modes. The changelog documents what production traffic — and, for 0.3.0, a five-model local sweep — taught me to fix here.

Install

pip install llmclean

Parse JSON out of messy LLM output

import llmclean

llmclean.load_json('Sure! Here you go:\n```json\n{"ok": True, "n": [1,2,3,]}\n```')
# → {'ok': True, 'n': [1, 2, 3]}          a real dict, not a string

That one call strips the reasoning trace, unwraps the fence, discards the prose around it, fixes the Python True and the trailing comma, and parses. Returns None (or your default=) if there's genuinely no JSON in there. It never raises.

Three front doors

Nearly everyone arrives with one of three goals:

data   = llmclean.load_json(raw)    # I asked for JSON  → dict/list, or None
text   = llmclean.clean_text(raw)   # I asked for prose → clean plain text
report = llmclean.check(raw)        # is this output broken?

# Strip <think> blocks, "Sure! Here's...", markdown, and smart quotes in one pass
llmclean.clean_text('<think>hmm</think>Sure! Here is the answer:\n\n# Title\n\n- **bold** point')
# → 'Title\n\nbold point'

# Detect degeneration instead of silently hiding it
llmclean.check("We tune the parameter parameter parameter values today.")
# → {'degenerate': True,
#    'rules_fired': ['top_token_frac', 'adjacent_dup_rate'], ...}

Going deeper

The front doors are composition over a dozen single-purpose functions. Use them directly when you want to control the pipeline:

Function	Removes / does
`strip_fences`	``` and `~~~` wrappers, incl. CRLF and unclosed fences
`enforce_json`	repairs malformed JSON (returns a string)
`trim_repetition`	runaway repeated sentences at the tail
`strip_reasoning_trace`	`<think>…</think>` blocks, incl. DeepSeek's lone `</think>`
`strip_preamble`	"Sure! Here's…" / "Hope that helps!"
`strip_markdown`	headers, bold, bullets, links → plain prose
`strip_invisibles`	zero-width and control characters
`normalize_typography`	smart quotes, em dashes, ellipsis → ASCII
`degeneracy_score`	full degeneration report (5 rules)
`collapse_word_runs`	`"parameter parameter parameter"` → `"parameter"`
`collapse_intra_word_runs`	`"thresholdinginginging"` → `"thresholding"`

Every one of them returns its input unchanged on failure or wrong type, so any order composes without an exception path. Full examples in USAGE.md.

Scope is measured, not assumed: the text functions come from a five-model local sweep (llama3.1 / gemma4 / qwen2.5 / deepseek-r1 / mistral). The changelog records what reproduced locally (markdown, fences, fullwidth-in-prose) versus what is a frontier-cloud-model trait tested with synthetic fixtures (smart quotes, em dashes, zero-width characters).

Common tasks

If you're trying to…	Use
parse JSON from an LLM response in Python	`llmclean.load_json(raw)`
fix invalid JSON returned by GPT / Claude / Llama	`llmclean.load_json(raw)`
remove `<think>` tags from DeepSeek-R1 output	`strip_reasoning_trace(raw)`
strip markdown from an LLM response for TTS	`strip_markdown(raw)`
remove "Sure! Here's" preamble from a model answer	`strip_preamble(raw)`
remove em dashes / smart quotes from AI text	`normalize_typography(raw)`
remove zero-width / invisible characters from AI text	`strip_invisibles(raw)`
detect when a model is repeating itself or degenerating	`llmclean.check(raw)`

Debugging

Every public function is never-raise: on failure it returns its input unchanged. That guarantee used to make internal bugs invisible. It no longer does — each fallback logs to a standard named logger, silent by default:

import llmclean
llmclean.enable_debug_logging()      # or configure logging.getLogger("llmclean")

llmclean.strip_markdown(weird_input)
# WARNING llmclean: llmclean.strip_markdown returned its input unchanged
#   after AttributeError: ...
# Traceback (most recent call last): ...

Unexpected failures log at WARNING with a full traceback; expected misses (no JSON present) log at DEBUG. Nothing is emitted unless your application configures logging, per standard library practice. A broken log handler still can't make a call raise.

What it doesn't do (and the thing to use instead)

Validate JSON against a schema — use jsonschema or pydantic
Re-prompt the model to fix its output — use instructor
Constrain the model at generation time so it can't produce broken output — use outlines

These are different problems with different tools. llmclean handles the post-hoc cleanup pass; compose it with the above if you need more.

Design choices

Three constraints kept while iterating:

The library should never raise. Every public function returns the original input on failure, so it composes safely in pipelines that can't afford an exception path.

Stay zero-dep. The standard library is sufficient for what this needs to do, and pulling in a dependency would force every downstream user to deal with version conflicts they didn't sign up for.

Be predictable. Same input always produces the same output. No external state, no model calls, no fuzzy heuristics that change behaviour silently across versions.

Known limitations

Some inputs land llmclean in known false-positive territory. Two worth flagging:

strip_fences will remove a single language name if it's the only content inside a fence — so if your model literally emits ```\njson\n``` as a one-word answer, that content disappears. The aggressive language-tag cleanup catches stray tags from real-world fence variants, and the trade-off is documented in the test test_lone_language_word_as_content_gets_stripped.

enforce_json's double-quote collapse only handles the symmetric form ""text"". The asymmetric variants Sakhi's pipeline also handles (: ""x and x"") corrupt legitimate empty-string values, so they're deliberately omitted here.

Tests

pip install "llmclean[dev]"
pytest -v

194 tests across the modules at 0.4.0. Includes characterization tests for known trade-offs (empty-string preservation, lone-language-tag strip) and real-model-output fixtures (deepseek-r1 reasoning trace, gemma4 markdown) so future changes can't silently regress them.

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.4.0

Jul 20, 2026

0.3.0

Jun 21, 2026

0.2.0

May 11, 2026

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmclean-0.4.0.tar.gz (58.6 kB view details)

Uploaded Jul 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llmclean-0.4.0-py3-none-any.whl (31.7 kB view details)

Uploaded Jul 20, 2026 Python 3

File details

Details for the file llmclean-0.4.0.tar.gz.

File metadata

Download URL: llmclean-0.4.0.tar.gz
Upload date: Jul 20, 2026
Size: 58.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llmclean-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`aacbd2fcc6c6677a535d0c0d54eea5451645c7e892f9e995471d33515573fd3f`
MD5	`a7f213d066d8d61b0eafe837875e1f07`
BLAKE2b-256	`291eb3002d627f6605aae312126001c6e3dec2635032460d3f5f5e21df302290`

See more details on using hashes here.

File details

Details for the file llmclean-0.4.0-py3-none-any.whl.

File metadata

Download URL: llmclean-0.4.0-py3-none-any.whl
Upload date: Jul 20, 2026
Size: 31.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for llmclean-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fcfd9d8551e978bbaed527887eb0efab5bd33aacf2127d6ecc129e958bc7b96d`
MD5	`b07ead2bc86989c62560044a494ec28e`
BLAKE2b-256	`d7e440446592686c1e655921525ba37aac1c6b09e00f8be033601be2f07aad02`

See more details on using hashes here.

llmclean 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmclean

Install

Parse JSON out of messy LLM output

Three front doors

Going deeper

Common tasks

Debugging

What it doesn't do (and the thing to use instead)

Design choices

Known limitations

Tests

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes