Salvage structured data from LLM responses that didn't follow instructions.
Project description
llm-salvage
Salvage structured data from LLM responses that didn't follow instructions.
pip install llm-salvage
What this is for
You ask a local model for structured output. It mostly does what you said, but:
- It wrapped the JSON in
```jsonfences when you said not to. - It used
"sentiment"instead of"verdict", orBullishinstead ofBULLISH. - It misspelled a tag name -
[VERDCIT]instead of[VERDICT]. - It returned trailing commas, smart quotes, or nested objects where you wanted strings.
- It wrote a thoughtful paragraph before the structured output you asked for.
You can prompt around these problems, retry with stricter instructions, or switch to a model with better tool-calling support. Or you can accept that local models do this sometimes and parse what you got.
llm-salvage is the third option. It applies deterministic corrections,
extracts data in tagged or JSON or assignment formats, validates against a
schema, and returns a result you can inspect - with a record of every fix
that was applied along the way.
What this is not
It does not call any LLM. It does not retry. It does not depend on Pydantic, PyYAML, or any other library by default. It does not know what model produced the text it's parsing. It is not a replacement for Instructor or PydanticAI - if you have a frontier model with reliable tool-calling, those libraries are simpler and more powerful. This library is for when tool-calling isn't available or isn't reliable, and you need to make sense of raw text output.
Quick start
from llm_salvage import ResponseParser, Schema, Field
schema = Schema(fields={
"sentiment": Field(choices=["positive", "negative", "neutral"]),
"confidence": Field(choices=["high", "medium", "low"]),
"summary": Field(min_length=20),
})
response = '''
```json
{
"sentiment": "Positive",
"confidence": "HIGH",
"summary": "The product launch exceeded expectations across all key metrics.",
}
```
'''
result = ResponseParser(schema).parse(response)
if result.ok:
print(result.data["sentiment"]) # "POSITIVE"
print(result.corrections) # ['stripped_code_fences', 'removed_trailing_commas', ...]
else:
for error in result.errors:
print(error)
The parser stripped the code fences, repaired the trailing comma, normalized
"Positive" to match the schema's choices, and recorded each fix as a
correction code. The response text never raised an exception - the parser
returns a ParseResult you inspect.
How it works
Four passes, in order:
1. Structural corrections. Code fence removal, BOM stripping, line ending normalization, tag-name typo correction (when a typo map is configured), auto-closing of unclosed tags whose names match schema fields.
2. Extraction. The parser detects whether the response uses tagged, JSON, or assignment format and tries them in order. JSON keys are matched against schema field names directly, with optional aliases for legacy or domain-specific naming.
3. Validation. Field types are checked, choices are normalized case-insensitively, probability dicts are summed, week-range strings are parsed into structured form. Validation never modifies data destructively
- if a value can't be normalized, it's reported as an error.
4. Telemetry (optional). Each parse can write a JSONL event recording which corrections were applied, what errors remained, and which model the response came from. Over time this builds a corpus you can query to see which models need which corrections.
Schema definition
Schemas can be defined in code:
from llm_salvage import Schema, Field, FieldType, Formats
schema = Schema(
fields={
"topic": Field(choices=["billing", "technical", "general"]),
"priority": Field(choices=["urgent", "normal", "low"]),
"summary": Field(min_length=10, max_length=500),
"needs_human_review": Field(type=FieldType.STRING, required=False, default="no"),
},
formats=[Formats.TAGGED, Formats.JSON],
)
Or loaded from a file:
from llm_salvage import Schema
schema = Schema.from_file("schemas/support_ticket.yaml")
Where support_ticket.yaml looks like:
fields:
topic:
choices: [billing, technical, general]
priority:
choices: [urgent, normal, low]
summary:
min_length: 10
max_length: 500
needs_human_review:
type: string
required: false
default: "no"
formats: [tagged, json]
YAML, JSON, and TOML are all supported. YAML requires pip install 'llm-salvage[yaml]'.
Field types
| Type | Use for |
|---|---|
STRING |
Free-form text with optional min_length/max_length |
CHOICE |
Enum of allowed values, case-insensitive |
INTEGER |
Whole numbers |
FLOAT |
Decimal numbers |
PROBABILITY |
Dict of label→int that should sum to ~100 |
WEEK_RANGE |
Strings like "2-4 weeks" parsed to {min, max} |
A field's type is inferred from its arguments - Field(choices=[...]) is a
CHOICE field, Field(min_length=20) is a STRING field. Specify
type=FieldType.X explicitly when the inference would be wrong.
Adapters
Optional integrations that activate when their dependency is installed.
Pydantic - convert between Schema and Pydantic models:
# pip install 'llm-salvage[pydantic]'
from llm_salvage.adapters.pydantic import schema_from_pydantic, to_pydantic
from pydantic import BaseModel
class Ticket(BaseModel):
topic: str
priority: str
summary: str
schema = schema_from_pydantic(Ticket)
result = ResponseParser(schema).parse(response)
ticket = to_pydantic(result, Ticket)
json-repair - use the json-repair library for more robust JSON repair:
# pip install 'llm-salvage[repair]'
# No code change needed - the parser uses json-repair automatically when installed.
Telemetry
When you pass a log_path, the parser writes one JSONL event per parse
attempt, recording corrections applied, errors encountered, and the model
name. This is opt-in:
parser = ResponseParser(
schema,
log_path="parses.jsonl",
model="llama3.2:3b",
)
for response in responses:
parser.parse(response, task_id=response.task_id)
After a few hundred parses, you can ask the corpus what each model needs:
from llm_salvage import model_profile
profile = model_profile("parses.jsonl", "llama3.2:3b")
# {
# "model": "llama3.2:3b",
# "events": 847,
# "valid_pct": 89.4,
# "corrections": {
# "stripped_code_fences": 612,
# "case_normalized_BULLISH": 243,
# ...
# },
# "top_correction": "stripped_code_fences"
# }
This is the most useful piece of the library for ongoing operations. It turns the parser into a feedback loop: you see which corrections each model consistently needs, which suggests which prompt changes would have the biggest effect.
Set log_corrections_only=True if you only want to record events where
corrections were actually applied - useful when you're parsing high volume
and don't need a record of every clean parse.
Comparison with other libraries
| Library | Use when |
|---|---|
| Instructor | You're using a model with reliable tool-calling. |
| PydanticAI | You're building agents and want a full framework. |
| json-repair | You only need JSON repair, no schema or tagged formats. |
llm-salvage (this) |
Local models, mixed formats, post-hoc parsing. |
These compose. You can use Instructor for your frontier-model path and
llm-salvage for your local-model fallback in the same codebase.
Examples
The examples/ directory has end-to-end examples covering
several common domains:
examples/sentiment_analysis.py- review classificationexamples/support_triage.py- customer ticket routingexamples/content_moderation.py- flag and category extractionexamples/product_extraction.py- pulling structured product data from descriptionsexamples/code_review.py- extracting findings from LLM code reviewexamples/medical_triage.py- symptom severity classification
Documentation
docs/comparison.md- when to reach for which librarydocs/schema-files.md- YAML/JSON/TOML schema syntaxdocs/telemetry.md- interpreting JSONL telemetrydocs/adapters.md- Pydantic and json-repair adaptersdocs/limitations.md- known v0.1.0 limitations and workarounds
Status
v0.1.0 is alpha. The API may change before 1.0. If you find a parsing
case that should work but doesn't, opening an issue with the response text
is the most useful contribution - telemetry corpora from real workloads
beat invented test cases.
License
MIT - see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llm_salvage-0.1.0.tar.gz.
File metadata
- Download URL: llm_salvage-0.1.0.tar.gz
- Upload date:
- Size: 47.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b3078c35a7423a80db2f0eb5ded459f4440b6bd99bf02e1ca5db0ff78485ac3
|
|
| MD5 |
d53bc8ed50737faa6048e902b846228c
|
|
| BLAKE2b-256 |
1069a07b2b58f7c2e1c81f87b5b0c974a1337265a8c9e432836ef1b8049ddb92
|
Provenance
The following attestation bundles were made for llm_salvage-0.1.0.tar.gz:
Publisher:
publish.yml on bme10/llm-salvage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_salvage-0.1.0.tar.gz -
Subject digest:
8b3078c35a7423a80db2f0eb5ded459f4440b6bd99bf02e1ca5db0ff78485ac3 - Sigstore transparency entry: 1393221892
- Sigstore integration time:
-
Permalink:
bme10/llm-salvage@3e6a13900955f47e66de57c0025a5f4f20345c59 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/bme10
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3e6a13900955f47e66de57c0025a5f4f20345c59 -
Trigger Event:
release
-
Statement type:
File details
Details for the file llm_salvage-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llm_salvage-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4191112df4ffbb3fe8a2e81aa3ca407571ef0d39dcca0ceafa41a4a650ed4e10
|
|
| MD5 |
01889b0c2641bd3066d9167942d33aab
|
|
| BLAKE2b-256 |
a4f1ca9b8788b4a166a0ce72e8aab493a23335820920c6ae49e59ed642822ff9
|
Provenance
The following attestation bundles were made for llm_salvage-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on bme10/llm-salvage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llm_salvage-0.1.0-py3-none-any.whl -
Subject digest:
4191112df4ffbb3fe8a2e81aa3ca407571ef0d39dcca0ceafa41a4a650ed4e10 - Sigstore transparency entry: 1393221896
- Sigstore integration time:
-
Permalink:
bme10/llm-salvage@3e6a13900955f47e66de57c0025a5f4f20345c59 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/bme10
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@3e6a13900955f47e66de57c0025a5f4f20345c59 -
Trigger Event:
release
-
Statement type: