Utilities for cleaning and normalizing raw LLM output
Project description
llmclean
Small Python library for cleaning the noise out of raw LLM output. Strips markdown fences, repairs malformed JSON, trims runaway repetitions. Zero runtime dependencies — pure standard library.
I built this because my other projects (Sakhi, Resume-parser) kept reinventing the same five or six regex passes against the same recurring failure modes. The 0.2.0 changelog documents what production traffic on those projects taught me to fix here.
Install
pip install llmclean
What it does
from llmclean import strip_fences, enforce_json, trim_repetition
# ```lang ... ``` wrappers, including tilde fences and CRLF line endings
strip_fences('```json\n{"name": "Alice"}\n```')
# → '{"name": "Alice"}'
# JSON buried in prose, with trailing comma + Python literals
enforce_json('Here you go: {"ok": True, "items": [1,2,3,]}')
# → '{\n "ok": true,\n "items": [1, 2, 3]\n}'
# Model looped on the same sentence
trim_repetition("The answer is 42. This is final. This is final.")
# → 'The answer is 42. This is final.'
enforce_json runs a pipeline of strategies in order and stops at the first one that produces parseable JSON. Strategies cover: existing valid JSON, fences, prose around the JSON, BOM at position 0, doubled-quote overruns like ""value"", trailing commas, Python True/False/None, single-quoted strings, unquoted keys, and unclosed brackets. Full set in USAGE.md.
What it doesn't do (and the thing to use instead)
- Validate JSON against a schema — use
jsonschemaorpydantic - Re-prompt the model to fix its output — use
instructor - Constrain the model at generation time so it can't produce broken output — use
outlines
These are different problems with different tools. llmclean handles the post-hoc cleanup pass; compose it with the above if you need more.
Design choices
Three constraints kept while iterating:
The library should never raise. Every public function returns the original input on failure, so it composes safely in pipelines that can't afford an exception path.
Stay zero-dep. The standard library is sufficient for what this needs to do, and pulling in a dependency would force every downstream user to deal with version conflicts they didn't sign up for.
Be predictable. Same input always produces the same output. No external state, no model calls, no fuzzy heuristics that change behaviour silently across versions.
Known limitations
Some inputs land llmclean in known false-positive territory. Two worth flagging:
strip_fences will remove a single language name if it's the only content inside a fence — so if your model literally emits ```\njson\n``` as a one-word answer, that content disappears. The aggressive language-tag cleanup catches stray tags from real-world fence variants, and the trade-off is documented in the test test_lone_language_word_as_content_gets_stripped.
enforce_json's double-quote collapse only handles the symmetric form ""text"". The asymmetric variants Sakhi's pipeline also handles (: ""x and x"") corrupt legitimate empty-string values, so they're deliberately omitted here.
Tests
pip install "llmclean[dev]"
pytest -v
78 tests across the three modules at v0.2.0. Includes characterization tests for known trade-offs (empty-string preservation, lone-language-tag strip) so future changes can't silently regress them.
License
MIT.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llmclean-0.2.0.tar.gz.
File metadata
- Download URL: llmclean-0.2.0.tar.gz
- Upload date:
- Size: 25.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
307a9285e0700a3c8698f5e643790076678eb4ab691a05a81fe5098f5282900f
|
|
| MD5 |
94aeb932537bf4f31d1e1874ca4a6b3c
|
|
| BLAKE2b-256 |
3cca06d6098828587ca547a5ef433bf110353f6f1c957a6740955e1e27505727
|
File details
Details for the file llmclean-0.2.0-py3-none-any.whl.
File metadata
- Download URL: llmclean-0.2.0-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab7fd56b273ab5118c70da5b195cbc00e778185abdb3be6b6882cdf089574452
|
|
| MD5 |
70ef80557cdfdf2e60e372a7b8438d15
|
|
| BLAKE2b-256 |
72fd519547a4eea160754d0c4e0e3b41043122dacc60e66d84dd755708ffee4b
|