Skip to main content

Utilities for cleaning and normalizing raw LLM output

Project description

llmclean

Small Python library for cleaning the noise out of raw LLM output. Strips markdown fences, repairs malformed JSON, trims runaway repetitions. Zero runtime dependencies — pure standard library.

I built this because my other projects (Sakhi, Resume-parser) kept reinventing the same five or six regex passes against the same recurring failure modes. The 0.2.0 changelog documents what production traffic on those projects taught me to fix here.

Install

pip install llmclean

What it does

from llmclean import strip_fences, enforce_json, trim_repetition

# ```lang ... ``` wrappers, including tilde fences and CRLF line endings
strip_fences('```json\n{"name": "Alice"}\n```')
# → '{"name": "Alice"}'

# JSON buried in prose, with trailing comma + Python literals
enforce_json('Here you go: {"ok": True, "items": [1,2,3,]}')
# → '{\n  "ok": true,\n  "items": [1, 2, 3]\n}'

# Model looped on the same sentence
trim_repetition("The answer is 42. This is final. This is final.")
# → 'The answer is 42. This is final.'

enforce_json runs a pipeline of strategies in order and stops at the first one that produces parseable JSON. Strategies cover: existing valid JSON, fences, prose around the JSON, BOM at position 0, doubled-quote overruns like ""value"", trailing commas, Python True/False/None, single-quoted strings, unquoted keys, and unclosed brackets. Full set in USAGE.md.

What it doesn't do (and the thing to use instead)

  • Validate JSON against a schema — use jsonschema or pydantic
  • Re-prompt the model to fix its output — use instructor
  • Constrain the model at generation time so it can't produce broken output — use outlines

These are different problems with different tools. llmclean handles the post-hoc cleanup pass; compose it with the above if you need more.

Design choices

Three constraints kept while iterating:

The library should never raise. Every public function returns the original input on failure, so it composes safely in pipelines that can't afford an exception path.

Stay zero-dep. The standard library is sufficient for what this needs to do, and pulling in a dependency would force every downstream user to deal with version conflicts they didn't sign up for.

Be predictable. Same input always produces the same output. No external state, no model calls, no fuzzy heuristics that change behaviour silently across versions.

Known limitations

Some inputs land llmclean in known false-positive territory. Two worth flagging:

strip_fences will remove a single language name if it's the only content inside a fence — so if your model literally emits ```\njson\n``` as a one-word answer, that content disappears. The aggressive language-tag cleanup catches stray tags from real-world fence variants, and the trade-off is documented in the test test_lone_language_word_as_content_gets_stripped.

enforce_json's double-quote collapse only handles the symmetric form ""text"". The asymmetric variants Sakhi's pipeline also handles (: ""x and x"") corrupt legitimate empty-string values, so they're deliberately omitted here.

Tests

pip install "llmclean[dev]"
pytest -v

78 tests across the three modules at v0.2.0. Includes characterization tests for known trade-offs (empty-string preservation, lone-language-tag strip) so future changes can't silently regress them.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llmclean-0.2.0.tar.gz (25.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llmclean-0.2.0-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file llmclean-0.2.0.tar.gz.

File metadata

  • Download URL: llmclean-0.2.0.tar.gz
  • Upload date:
  • Size: 25.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for llmclean-0.2.0.tar.gz
Algorithm Hash digest
SHA256 307a9285e0700a3c8698f5e643790076678eb4ab691a05a81fe5098f5282900f
MD5 94aeb932537bf4f31d1e1874ca4a6b3c
BLAKE2b-256 3cca06d6098828587ca547a5ef433bf110353f6f1c957a6740955e1e27505727

See more details on using hashes here.

File details

Details for the file llmclean-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: llmclean-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for llmclean-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ab7fd56b273ab5118c70da5b195cbc00e778185abdb3be6b6882cdf089574452
MD5 70ef80557cdfdf2e60e372a7b8438d15
BLAKE2b-256 72fd519547a4eea160754d0c4e0e3b41043122dacc60e66d84dd755708ffee4b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page