Probabilistic JSON repair library powered by Rust - fixes broken JSON from LLMs

These details have not been verified by PyPI

Project links

Project description

llmjson

Make LLM “JSON” outputs production‑grade.

LLMs are great at structured-ish output, but real pipelines still see markdown fences, extra prose (“Here’s the JSON…”, “json입니다~”), trailing commas/smart quotes, missing commas/closers, etc. Strict parsers (json, orjson, …) treat that as a hard failure → retries, latency, and brittle tool/function-calls.

llmjson is a Rust-powered JSON repair pipeline with Python bindings:

Extract the JSON span from arbitrary text
Repair common errors cheaply first (deterministic heuristics)
Recover intent via probabilistic Top‑K parsing + confidence + repair trace
Optionally ask an LLM for a minimal byte-offset patch only when needed, then re-validate

Want zero-integration friction? Enable the bundled orjson-compatible shim:

export JSONPROB_ORJSON_MODE=auto

Features

Extraction: Strip markdown fences + prefix/suffix garbage and isolate the JSON span
Fast path: Valid JSON parses immediately
Heuristic repair: Low-cost automatic fixes applied before beam search
Probabilistic Top‑K repair: Returns multiple candidates with confidence scores + repair traces
Schema-aware ranking (optional): Lightweight schema hints help choose the right candidate
Deterministic mode (seeded): Make probabilistic results reproducible via deterministic_seed
LLM fallback (optional): Ask an LLM for a minimal patch only when local repairs are low-confidence
Scale pipeline (huge JSON): Safe split-point parallelism + optional tape/IR, with recursive parsing for large nested containers

Built for LLM Pipelines

Accepts raw model text (not just pure JSON) and extracts the JSON span
Produces strict JSON (or returns Top‑K strict candidates), so downstream schema validation stays simple
Returns a repair trace (ops + byte spans) that’s useful for debugging, audits, or “show the model what you meant”
Uses an LLM only as a last resort (minimal patch + re-validate), keeping latency/cost predictable

In the included “LLM messy JSON” suite, strict parsers fail while llmjson succeeds end‑to‑end (see Benchmarks below).

Common LLM Failure Modes

Issue	Example	Fixed
Unquoted keys	`{name: "Alice"}`	`{"name": "Alice"}`
Single quotes	`{'key': 'value'}`	`{"key": "value"}`
Python literals	`{"a": True, "b": None}`	`{"a": true, "b": null}`
Trailing commas	`{"a": 1, "b": 2,}`	`{"a": 1, "b": 2}`
Missing commas	`{"a": 1 "b": 2}`	`{"a": 1, "b": 2}`
JS comments	`{/* comment */ "a": 1}`	`{"a": 1}`
Unquoted array values	`[admin, user]`	`["admin", "user"]`
Markdown code fences	```json {...} ```	`{...}`
Prefix/suffix garbage	`Response: {...} EOF`	`{...}`
Unclosed strings/brackets	`{"a": "hello`	`{"a": "hello"}`

Installation

Install (recommended)

uv add llmjson
# or: python -m pip install llmjson

Build from source (development)

1) Install Rust toolchain

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

2) Build and install the PyO3 extension

# Clone the repository
uv venv
source .venv/bin/activate  # or `.venv\Scripts\activate` on Windows

# Install maturin and build
uv pip install maturin
maturin develop -m rust-pyo3/Cargo.toml

# Install the Python package (editable)
uv pip install -e .

Quick Start

Python Library

from llmjson import RepairOptions, parse

# Simple usage
result = parse('{"a": 1, "b": 2,}')  # trailing comma
print(result.status)           # "repaired"
print(result.best.value)       # {'a': 1, 'b': 2}

# With options
result = parse(
    '''```json
    {
        name: "Alice",
        age: 30,
        active: True,
        roles: [admin, user,]
    }
    ```''',
    RepairOptions(
        mode="auto",
        top_k=3,
        beam_width=32,
        max_repairs=50,
    ),
)

print(result.status)                    # "repaired"
print(result.best.value)                # {'name': 'Alice', 'age': 30, ...}
print(len(result.best.repairs))         # number of repairs applied
print(result.metrics.elapsed_ms)        # processing time

Reproducible Top‑K (deterministic_seed)

Beam search can have ties; for debugging and stable output ordering, set deterministic_seed:

result = parse(
    '{"a": 1 "b": 2}',  # missing comma
    RepairOptions(
        mode="probabilistic",
        top_k=5,
        deterministic_seed=42,
    ),
)

Schema Hints (pick the right candidate)

When input is ambiguous, return Top‑K and let llmjson re-rank candidates using a lightweight schema hint:

schema = {
    "required_keys": ["name", "age"],
    "types": {"name": "str", "age": "int"},
}

result = parse(
    '```json\n{name: "Alice", age: 30,}\n```',
    RepairOptions(mode="probabilistic", top_k=5, schema=schema),
)

print(result.best.validations.schema_match)  # 0.0 .. 1.0

CLI

# From stdin
echo '{"a": 1, "b": 2,}' | llmjson

# From file
llmjson --input broken.json

# With options
llmjson --input broken.json \
    --mode probabilistic \
    --beam-width 64 \
    --max-repairs 100 \
    --top-k 5

CLI Options

Option	Default	Description
`--input`, `-i`	stdin	Input file path
`--mode`	`auto`	`auto`, `strict_only`, `fast_repair`, `probabilistic`, `scale_pipeline`
`--scale-output`	`dom`	`dom` (materialize JSON) or `tape` (return IR only; value will be null)
`--top-k`	5	Number of candidate repairs to return
`--beam-width`	32	Beam search width
`--max-repairs`	20	Maximum repair operations per candidate
`--partial-ok`	true	Allow partial results on failure
`--allow-llm`	false	Enable LLM fallback for extreme cases
`--llm-provider`	`none`	`none`, `anthropic`, `claude_agent_sdk`
`--llm-mode`	`patch_suggest`	`patch_suggest` or `token_suggest` (patch is recommended)
`--llm-min-confidence`	0.2	Trigger LLM when best confidence is below this
`--debug`	false	Include debug information

What is `tape`?

tape is an internal IR (intermediate representation) for large JSON:

A flat list of TapeEntrys (token type + byte offset/length into the original input).
Containers (array_start / object_start) store a “jump” payload to their matching end entry.
This makes it cheaper to handle huge payloads (avoid building a full in-memory DOM) and enables safe parallel parse+merge in scale_pipeline.

When scale_output="tape":

result.best.value is None
result.best.ir["tape"] contains tape metadata (and, with debug=True, a truncated preview of entries)

FAQ (LLM + JSON)

“We already use structured output / function calling. Why do we need this?”
Because in production you still get near-JSON (code fences, extra prose, a trailing comma, a missing closer). Strict JSON parsing turns that into retries (latency/cost) or brittle failures. llmjson is the guardrail: it converts raw model text into strict JSON (or Top‑K strict candidates) and tells you exactly what it changed.

“Why Top‑K?”
When JSON is corrupted, there can be multiple plausible “intents”. Returning Top‑K candidates + confidence (and optional schema hints) lets you pick the right one deterministically instead of guessing.

“Is the scale pipeline always faster?”
No—parallel split/merge has overhead. It’s designed for huge valid JSON (GB‑scale root arrays or large nested containers) where scan/parse time dominates. For small inputs, strict parsing is faster.

Rust CLI (development) — mmap + deterministic seed

For batch parsing of very large files without allocating a giant Vec<u8> up front, the Rust CLI in rust/ uses mmap by default:

cd rust
cargo build --release
./target/release/llmjson --input huge.json --mode scale_pipeline --scale-output tape

Disable mmap: --no-mmap
Reproducible beam ordering: --deterministic-seed 42

orjson Drop-in Shim

Most LLM/agent stacks already call orjson.loads() everywhere. llmjson bundles an orjson-compatible shim so you can keep those call sites unchanged and still recover from “near‑JSON” outputs:

import orjson

data = orjson.loads(b'{"a": 1}')
blob = orjson.dumps({"a": 1})

By default the shim is strict (like real orjson). To enable repair/scale fallback without changing call sites:

export JSONPROB_ORJSON_MODE=auto

Benchmarks

Benchmarks were run on Python 3.12.0, macOS 14.1 (arm64) using benchmarks/bench.py.

For a detailed walkthrough with concrete Slack-context examples, see BENCHMARK.md.

1) LLM messy JSON suite (primary)

This suite reflects the context: LLM outputs like “json입니다~ …”, markdown fences, single quotes, unquoted keys, trailing commas, Python literals, missing commas, smart quotes, and missing closers.

Library / mode	Success	Correct	Best time / case
`json` (strict)	0/10	0/10	n/a
`ujson` (strict)	0/10	0/10	n/a
`orjson` (strict, real)	0/10	0/10	n/a
`orjson` (auto, llmjson shim)	10/10	10/10	23.9 µs
`llmjson.parse(mode=auto)`	10/10	10/10	20.0 µs
`llmjson.parse(mode=probabilistic)`	10/10	10/10	19.9 µs

Key point: drop-in call sites (import orjson; orjson.loads(...)) can go from 0% success → 100% success just by setting JSONPROB_ORJSON_MODE=auto.

2) Top‑K repair suite (secondary)

This suite checks whether the “intended” JSON object is recovered as the best candidate vs anywhere in the Top‑K (K=5) candidates.

Metric	Value
Top‑1 hit rate	7/8
Top‑K hit rate (K=5)	8/8
Avg candidates returned	1.25
Avg best confidence	0.57
Best time / case	38.7 µs

3) Large root-array parsing (big data angle)

Valid JSON only (parsing a single large root array).

Library	5 MB	20 MB
`json.loads(str)`	53.7 ms	209.9 ms
`ujson.loads(str)`	45.1 ms	172.0 ms
`orjson.loads(bytes)` (real)	26.8 ms	106.2 ms

llmjson also benchmarks llmjson.scale(serial|parallel) in the same script. On 5–20MB inputs the parallel path is slower due to overhead; it’s intended for much larger payloads (GB‑scale root arrays).

3b) Nested `corpus` split (targeted huge value)

If your payload looks like { "corpus": [ ... huge ... ], ... }, benchmarks/bench.py includes a nested_corpus_suite that benchmarks scale_target_keys=["corpus"] (and compares allow_parallel on/off). This is the practical “nested huge value” case from the Slack thread (and where PR‑102A style recursion/targeting matters).

In scale_output="tape" mode, large nested arrays/objects can be parsed recursively (and in parallel when enabled). Each segment is validated (strict tape parse) and falls back to a single strict parse on any mismatch, preserving correctness.

3c) CLI mmap suite (PR‑006)

If you care about batch/CLI parsing of very large files without allocating a giant Vec<u8> up front, set BENCH_CLI_MMAP_MB to run cli_mmap_suite (default mmap vs --no-mmap). You need the Rust CLI binary built first:

cd rust && cargo build --release

Reproduce

Because llmjson provides a top-level orjson shim, benchmark real orjson and the shim in separate environments:

# Env A: real orjson
python -m venv .venv-orjson
source .venv-orjson/bin/activate
python -m pip install orjson ujson
python benchmarks/bench.py

# Env B: llmjson (includes the shim)
python -m venv .venv-llmjson
source .venv-llmjson/bin/activate
python -m pip install llmjson ujson
python benchmarks/bench.py

Tune run sizes with env vars:

BENCH_MICRO_NUMBER=20000 BENCH_MICRO_REPEAT=5 \
BENCH_MESSY_NUMBER=2000 BENCH_MESSY_REPEAT=5 \
BENCH_TOPK_NUMBER=500 BENCH_TOPK_REPEAT=5 \
BENCH_LARGE_MB=5,20 BENCH_LARGE_NUMBER=3 BENCH_LARGE_REPEAT=3 \
BENCH_NESTED_MB=5,20 BENCH_NESTED_NUMBER=1 BENCH_NESTED_REPEAT=3 \
BENCH_CLI_MMAP_MB=512 \
python benchmarks/bench.py

Repair Pipeline

Input Text
    │
    ▼
┌─────────────────┐
│ 1. Extraction   │  Strip markdown fences, prefix/suffix garbage
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ 2. Heuristics   │  Fast fixes: quotes, comments, literals, commas
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ 3. Strict Parse │  Try standard JSON parse
└────────┬────────┘
         │ (if fails)
         ▼
┌─────────────────┐
│ 4. Beam Search  │  Probabilistic repair with Top-K candidates
└────────┬────────┘
         │ (if low confidence)
         ▼
┌─────────────────┐
│ 5. LLM Fallback │  Optional: Claude-assisted repair
└────────┬────────┘
         │
         ▼
    RepairResult

LLM Deep Repair (Optional)

For severely corrupted JSON where beam search is low-confidence, you can enable LLM-assisted repair.

Option A) Anthropic SDK

python -m pip install anthropic
export ANTHROPIC_API_KEY=...
export CLAUDE_MODEL=claude-3-5-sonnet-latest

from llmjson import AnthropicPatchSuggestProvider, RepairOptions, parse

result = parse(
    '{"a":1,"b":2, completely broken garbage here',
    RepairOptions(
        mode="probabilistic",
        allow_llm=True,
        llm_mode="patch_suggest",
        llm_min_confidence=0.2,
        llm_provider=AnthropicPatchSuggestProvider(),
    ),
)

print(result.metrics.llm_calls)
print(result.metrics.llm_time_ms)

Option B) Claude Agent SDK

from llmjson import RepairOptions, parse
from llmjson.claude_agent_sdk_provider import ClaudeAgentSDKProvider

# Set up your Claude Agent SDK agent
agent = ...  # your agent instance
provider = ClaudeAgentSDKProvider(agent=agent)

result = parse(
    '{"a":1,"b":2, completely broken garbage here',
    RepairOptions(
        mode="probabilistic",
        allow_llm=True,
        llm_mode="patch_suggest",
        llm_min_confidence=0.2,
        llm_provider=provider,
    ),
)

print(result.metrics.llm_calls)     # number of LLM calls made
print(result.metrics.llm_time_ms)   # LLM processing time

Result Structure

result = parse(text, options)

result.status          # "strict_ok" | "repaired" | "partial" | "failed"
result.best            # Best candidate (shortcut for candidates[best_index])
result.best_index      # Index of best candidate
result.candidates      # List of repair candidates

# Each candidate has:
candidate.value           # Parsed Python object
candidate.normalized_json # Normalized JSON string
candidate.confidence      # Confidence score (0-1)
candidate.cost           # Total repair cost
candidate.repairs        # List of repair operations applied

# Each repair operation:
repair.op        # Operation name (e.g., "wrap_unquoted_key")
repair.span      # (start, end) byte positions
repair.cost_delta # Cost of this repair
repair.note      # Human-readable description

Development

Run Tests

# Rust tests
cd rust && cargo test

# Python tests (parse tests are skipped unless PyO3 is installed)
PYTHONPATH=src python -m unittest discover -s tests -p 'test*.py' -v

Build Rust CLI (standalone)

cd rust
cargo build --release
./target/release/llmjson --input ../demo/broken.json

Architecture

llmjson/
├── rust/                    # Core Rust library
│   └── src/
│       ├── heuristic.rs     # Heuristic repairs
│       ├── beam.rs          # Beam search algorithm
│       ├── pipeline.rs      # Parse pipeline orchestration
│       └── ...
├── rust-pyo3/               # PyO3 Python bindings
│   └── src/lib.rs
└── src/json_prob_parser/    # Python package
    ├── arbiter.py           # Python orchestrator (Rust + optional LLM)
    ├── rust_core.py         # Thin PyO3 bridge
    ├── anthropic_provider.py
    ├── claude_agent_sdk_provider.py
    ├── llm.py               # LLM payload + patch ops
    └── types.py             # Data classes

License

MIT OR Apache-2.0

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

Dec 14, 2025

0.1.1

Dec 13, 2025

This version

0.1.0

Dec 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

agentjson-0.1.0-cp312-cp312-macosx_11_0_arm64.whl (517.4 kB view details)

Uploaded Dec 13, 2025 CPython 3.12macOS 11.0+ ARM64

File details

Details for the file agentjson-0.1.0-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

Download URL: agentjson-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Upload date: Dec 13, 2025
Size: 517.4 kB
Tags: CPython 3.12, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.14

File hashes

Hashes for agentjson-0.1.0-cp312-cp312-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`0a2d611d1b6d3d6f28dc023722ff934627e3537cb05322422619f70dc0795500`
MD5	`28ec2afc2bdb0ffdc507792f359e08b7`
BLAKE2b-256	`a535fcec0faaf3661407ab9957f3b315a52885a5371e6b3a3b075da6f9922a5c`

See more details on using hashes here.

agentjson 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

llmjson

Features

Built for LLM Pipelines

Common LLM Failure Modes

Installation

Install (recommended)

Build from source (development)

1) Install Rust toolchain

2) Build and install the PyO3 extension

Quick Start

Python Library

Reproducible Top‑K (deterministic_seed)

Schema Hints (pick the right candidate)

CLI

CLI Options

What is tape?

FAQ (LLM + JSON)

Rust CLI (development) — mmap + deterministic seed

orjson Drop-in Shim

Benchmarks

1) LLM messy JSON suite (primary)

2) Top‑K repair suite (secondary)

3) Large root-array parsing (big data angle)

3b) Nested corpus split (targeted huge value)

3c) CLI mmap suite (PR‑006)

Reproduce

Repair Pipeline

LLM Deep Repair (Optional)

Option A) Anthropic SDK

Option B) Claude Agent SDK

Result Structure

Development

Run Tests

Build Rust CLI (standalone)

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

What is `tape`?

3b) Nested `corpus` split (targeted huge value)