Schema-aligned parsing
Project description
parsantic
The structured extraction toolkit: parse, stream, extract, update, patch, and coerce LLM output — locally, deterministically, with one clean API.
Install
uv add parsantic
For LLM extraction and update features (OpenAI, Anthropic, Gemini, etc.):
uv add "parsantic[ai]"
[!IMPORTANT] Use
parsantic[ai]for extraction and update features. Useparsantic[vision]when you want local PDF rasterization or image preprocessing.
What it does
LLM output is messy. Models wrap JSON in markdown, add trailing commas, use
wrong-case enum values, and return partial objects mid-stream. Most tools
deal with this by retrying the LLM call. parsantic fixes it locally in
one pass:
from enum import Enum
from pydantic import BaseModel
from parsantic import parse
class Priority(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
class Task(BaseModel):
title: str
priority: Priority
days_left: int
done: bool = False
# The LLM returned this mess:
llm_output = """
Sure! Here's the task you requested:
```json
{
// Task details
"Title": "Fix the login bug",
"priority": "HIGH",
"Days-Left": "3",
"done": false,
}
```
Let me know if you need anything else!
"""
task = parse(llm_output, Task).value
# Task(title='Fix the login bug', priority=<Priority.HIGH: 'high'>, days_left=3, done=False)
One call. No retry. Markdown fences, comments, surrounding prose, wrong-case keys, kebab-to-snake key normalization, enum coercion, string-to-int coercion, trailing commas — all handled.
It works even when the JSON is still arriving. Feed tokens as they come and get valid, typed partial objects back while the LLM is still generating:
from parsantic import parse_stream
stream = parse_stream(Task)
stream.feed('{"title": "Stre')
print(stream.parse_partial().value)
# TaskPartial(title='Stre', priority=None, days_left=None, done=None) ← partial but typed
stream.feed('aming task", "pr')
print(stream.parse_partial().value)
# TaskPartial(title='Streaming task', priority=None, days_left=None, done=None)
stream.feed('iority": "low", "days_left": 5}')
task = stream.finish().value
# Task(title='Streaming task', priority=<Priority.LOW: 'low'>, days_left=5, done=False)
Every call to parse_partial() returns a valid Pydantic object (a generated
TaskPartial with all-optional fields) with whatever values are available so far.
No waiting for the full response.
Extract from text
Turn unstructured text into typed objects — with source grounding:
from pydantic import BaseModel
from parsantic import extract
class Person(BaseModel):
name: str
role: str
years_experience: int
result = extract(
"Dr. Sarah Chen is a principal ML engineer at Anthropic (3 years).",
Person,
model="gemini:gemini-2.5-flash-lite",
)
result.value
# Person(name='Sarah Chen', role='principal ML engineer', years_experience=3)
# Every extracted value is grounded back to the source text
result.evidence[0]
# FieldEvidence(path='/name', value_preview='Sarah Chen', char_interval=(4, 14), ...)
# Per-field support metadata is available on the result
result.field_statuses[0]
# FieldStatus(path='/name', support='exact', confidence=1.0)
Coerce tool arguments
LLM tool calls return raw dicts with wrong types and casing.
coerce() fixes them against your schema — no string parsing needed:
from parsantic import coerce
# Raw dict from an LLM tool call
tool_args = {"title": "Deploy", "priority": "HIGH", "days_left": "2", "done": "true"}
task = coerce(tool_args, Task).value
# Task(title='Deploy', priority=<Priority.HIGH: 'high'>, days_left=2, done=True)
The coercion engine handles case-insensitive and accent-insensitive enum matching, string-to-number conversion, key normalization, and more — each tracked with a penalty score so the least-edited interpretation always wins.
Update existing objects
Once you've extracted a large object, new information may arrive. Asking the
LLM to regenerate all 50 fields risks silently dropping data it wasn't
paying attention to. update() handles this — it asks the LLM to produce
only the changes as JSON Patch operations, applies them, and validates the
result:
from pydantic import BaseModel
from parsantic import update
class User(BaseModel):
name: str
role: str
skills: list[str]
years_experience: int
profile = {
"name": "Alex Chen",
"role": "Software Engineer",
"skills": ["Python", "TypeScript", "SQL"],
"years_experience": 3,
}
result = update(
existing=profile,
instruction="Alex got promoted to Senior Engineer and picked up Rust.",
target=User,
model="openai:gpt-4o-mini",
)
result.value
# User(name='Alex Chen', role='Senior Software Engineer',
# skills=['Python', 'TypeScript', 'SQL', 'Rust'], years_experience=5)
result.patches
# [JsonPatchOp(op='replace', path='/role', value='Senior Software Engineer'),
# JsonPatchOp(op='replace', path='/years_experience', value=5),
# JsonPatchOp(op='add', path='/skills/-', value='Rust')]
The original document is never mutated. Under the hood, update() prompts
the LLM for RFC 6902 patches, parses the messy response with parse(),
applies the patches with safety rails (remove disabled by default), and
validates the result with schema-aware coercion. If validation fails, it
automatically retries with the error context.
Extract from PDFs and images
Pass a Document instead of a string to extract structured data from visual
content — scanned invoices, screenshots, research papers:
from pathlib import Path
from pydantic import BaseModel
from parsantic import extract
from parsantic.extract import Document
class Invoice(BaseModel):
invoice_number: str
vendor: str
total: float
result = extract(
Document.from_pdf(Path("invoice.pdf")),
Invoice,
model="gemini:gemini-3.1-flash-lite-preview",
)
result.value
# Invoice(invoice_number='INV-2024-001', vendor='Acme Corp', total=1250.00)
Images work the same way:
result = extract(
Document.from_image(Path("receipt.jpg")),
Invoice,
model="openai:gpt-4o-mini",
)
By default, PDFs with a text layer are extracted as text (no vision cost); otherwise pages are rasterized to images.
[!IMPORTANT] Multi-PDF extraction into a single
Documentis temporarily disabled while attachment-aware provenance is being implemented. For case-level workflows, create oneDocumentper PDF and run them withextract_batch()oraextract_batch().
For multi-document async extraction, prefer one Document per input PDF and
keep concurrency at the batch layer instead of mixing app-level fanout with
thread-wrapped sync calls:
import asyncio
from pathlib import Path
from pydantic import BaseModel
from parsantic.extract import Document, aextract_batch
class Invoice(BaseModel):
invoice_number: str = ""
vendor: str = ""
total: float = 0.0
async def main() -> None:
docs = [
Document.from_pdf(Path("invoice-1.pdf"), document_id="invoice-1"),
Document.from_pdf(Path("invoice-2.pdf"), document_id="invoice-2"),
]
result = await aextract_batch(
docs,
Invoice,
model="gemini:gemini-3.1-flash-lite-preview",
)
print([item.value for item in result.results])
asyncio.run(main())
For a richer end-to-end example, see:
examples/demo_pdf.pyfor a synthetic oncology summary extracted into a FHIR-shaped bundle with page provenanceexamples/demo_pdf_modes.pyfor a side-by-side comparison of the PDF modesexamples/demo_page_selection.pyfor deterministic page pruning before extraction
Use mode when you want to force a higher-level PDF strategy:
from parsantic.extract import ExtractOptions
# Whole document in one call.
result = extract(
Document.from_pdf(pdf_bytes),
Invoice,
model="gemini:gemini-3.1-flash-lite-preview",
options=ExtractOptions(mode="document", document_input="native"),
)
# Page-by-page vision with page provenance.
result = extract(
Document.from_pdf(pdf_bytes),
Invoice,
model="gemini:gemini-3.1-flash-lite-preview",
options=ExtractOptions(mode="page"),
)
# Hybrid: whole-document native PDF + page images for page-grounded fields.
result = extract(
Document.from_pdf(pdf_bytes),
Invoice,
model="gemini:gemini-3.1-flash-lite-preview",
options=ExtractOptions(
mode="hybrid",
document_input="native",
page_input="image",
),
)
print(result.sources["/total"]) # SourceRef(scope="page", pages=(1,))
print(result.sources["/vendor"]) # SourceRef(scope="document", pages=())
Use strategy when you want the explicit whole-document grounded plan:
from parsantic.extract import ExtractOptions, Strategy
result = extract(
Document.from_pdf(pdf_bytes),
Invoice,
model="gemini:gemini-3.1-flash-lite-preview",
options=ExtractOptions(
strategy=Strategy(plan="document_grounded"),
),
)
Deterministic page selection
For long PDFs with sparse relevant content, you can run a cheap page-analysis pass first, deterministically select a subset of pages, then extract only from that subset.
from pathlib import Path
from pydantic import BaseModel, Field
from parsantic import extract
from parsantic.extract import Document, analyze_pdf_source, select_pdf_pages
class LabsOnly(BaseModel):
hemoglobin_g_dl: float = Field(description="Hemoglobin lab value in g/dL")
creatinine_mg_dl: float = Field(description="Creatinine lab value in mg/dL")
pdf_path = Path("oncology-packet.pdf")
analysis = analyze_pdf_source(pdf_path)
selection = select_pdf_pages(analysis, LabsOnly, window=1, max_pages=4)
result = extract(
Document.from_pdf(pdf_path, page_indices=selection.page_indices),
LabsOnly,
model="gemini:gemini-2.5-flash-lite",
)
print(selection.page_indices)
print(selection.fallback_reason)
print(result.value)
This v1 flow is intentionally narrow:
- deterministic only
- page-level only
- opt-in
- fail-open: when selection is too broad or uncertain, it falls back to the full document
When page_indices are used with native PDF input, Parsantic now uploads a
PDF containing only those selected pages. If local PDF rewriting support is
unavailable, it falls back to sending the original PDF with page hints instead.
This works best on long PDFs with a usable text layer. For scan-heavy PDFs, the selector will often fail open and keep the full document.
Lower-level PDF and image control
For lower-level PDF/image control, MediaOptions is still available:
from parsantic.extract.options import ExtractOptions, MediaOptions
result = extract(
Document.from_pdf(pdf_bytes),
Invoice,
model="openai:gpt-4o-mini",
options=ExtractOptions(
media=MediaOptions(pdf_mode="raster", page_strategy="single"),
),
)
For rasterized PDFs/images, page_strategy="single" is usually the best default for
flat schemas. Use page_strategy="map_reduce" when you need per-page provenance or
the document is too long to bundle in one request.
mode |
Behavior |
|---|---|
"auto" |
Use text extraction for text-layer PDFs, otherwise rasterize pages (default) |
"document" |
Run one whole-document extraction |
"page" |
Run page-by-page extraction |
"hybrid" |
Run both a whole-document branch and a page branch, then merge |
Strategy plans:
strategy |
Behavior |
|---|---|
"balanced" |
Default strategy preset; resolves to the document_auto path |
Strategy(plan="document_grounded") |
Whole-document extraction with page-aware evidence grounding when page text boundaries are available |
document_input |
Behavior |
|---|---|
"auto" |
Let parsantic choose the whole-document representation |
"native" |
Send the raw PDF binary to the model |
"image" |
Rasterize the PDF and bundle page images into one whole-document request |
page_input |
Behavior |
|---|---|
"auto" |
Use the default page-grounded representation |
"image" |
Rasterize each PDF page to an image |
Advanced MediaOptions:
pdf_mode |
Behavior |
|---|---|
"auto" |
Text layer → text extraction; otherwise rasterize (default) |
"native" |
Send raw PDF binary to the model |
"raster" |
Convert every page to JPEG/PNG |
Vertex AI support
Use vertex: prefix with any Gemini model to route through Vertex AI:
result = extract(
"Dr. Sarah Chen is a principal ML engineer at Anthropic.",
Person,
model="vertex:gemini-2.5-flash",
provider_kwargs={"project_id": "my-project", "region": "us-central1"},
)
Credentials are resolved automatically from environment variables
(VERTEX_PROJECT_ID, VERTEX_REGION, GOOGLE_APPLICATION_CREDENTIALS)
or from gcloud auth application-default login.
Native structured output
When the model supports it (Gemini, OpenAI, etc.), parsantic can use the
provider's native JSON schema constraints instead of prompt-based extraction.
This is enabled by default ("auto") and falls back to prompt mode
transparently:
result = extract(
text,
MySchema,
model="gemini:gemini-3.1-flash-lite-preview",
options=ExtractOptions(structured_output="native"), # or "auto" (default)
)
structured_output |
Behavior |
|---|---|
"auto" |
Use native mode if the model supports it, otherwise prompt (default) |
"native" |
Force native JSON schema constraints |
"prompt" |
Always use prompt-based extraction |
If native mode fails validation, parsantic automatically recovers the raw JSON from the response and runs it through the local repair pipeline.
Streaming extraction
When the provider can stream structured output, parsantic can surface
typed partial objects during extraction, not just during low-level parsing.
For PDFs, the current streaming path is intentionally narrow: use a
single whole-document request (pdf_mode="native" or page_strategy="single"),
not page map-reduce or hybrid mode.
from pydantic import BaseModel
from parsantic import extract_stream
from parsantic.extract import Document, ExtractOptions, MediaOptions
class Invoice(BaseModel):
vendor: str = ""
total: float
events = extract_stream(
Document.from_pdf("invoice.pdf"),
Invoice,
model="gemini:gemini-2.5-flash",
options=ExtractOptions(
structured_output="native",
media=MediaOptions(pdf_mode="native", page_strategy="single"),
),
)
for event in events:
if event.is_final:
print("final:", event.result.value)
else:
print("partial:", event.value)
Candidate scoring
When the input is ambiguous, parsantic generates multiple candidate
interpretations and picks the one requiring the fewest transformations:
from parsantic import parse_debug
debug = parse_debug('{"title": "Review PR", "priority": "Critical", "days_left": 1}', Task)
for c in debug.candidates:
print(f" score={c.score} flags={c.flags}")
# score=-1 flags=() ← direct JSON parse (failed validation)
# score=3 flags=('case_insensitive',) ← coerced "Critical" → Priority.CRITICAL
print(debug.value)
# Task(title='Review PR', priority=<Priority.CRITICAL: 'critical'>, days_left=0, done=False)
Every coercion is tagged with a flag and a cost. You can inspect exactly what happened and why.
Comparison with similar libraries
parsantic focuses on one thing: getting a valid typed object from messy
LLM text with the least effort and fewest LLM calls.
Quick comparison
| parsantic | BAML | trustcall | llguidance | LangExtract | |
|---|---|---|---|---|---|
| Approach | Fix output locally | Fix output locally | Patch via LLM retry | Prevent at token level | LLM extraction pipeline |
| Handles invalid JSON text | Yes (local repair) | Yes (local repair) | N/A (tool calling) | N/A (constrained decoding) | No (expects valid JSON/YAML; can strip fences) |
| Streaming | Typed partial objects (Partial model) | Typed partial objects (generated Partial types) | — | Token-level masks | — |
| Updates | JSON Patch (targeted) | — | JSON Patch (LLM-generated) | — | Re-run extraction |
| Source grounding | Char/token spans + alignment | — | — | — | Char spans + fuzzy alignment |
| Schema | Pydantic models | .baml DSL |
Pydantic / functions / JSON Schema | JSON Schema (subset) / CFG / regex | Example-driven (own format) |
| Candidate scoring | Weighted flags, inspectable | Scoring heuristics (internal) | — | — | — |
| Install | pip install |
pip install (+ BAML CLI / codegen) |
pip install |
pip install (Rust-backed) |
pip install |
| Extra LLM calls on validation failure | 0 | 0 | Yes (patch retries; configurable) | 0 | 0 |
Detailed comparison
Each tool takes a fundamentally different approach to the structured-output problem. Here is how they differ in practice.
BAML — Schema-Aligned Parsing in Rust
BAML is the closest in philosophy: let the LLM generate freely, then fix the output locally. Its Rust-based parser handles the same classes of breakage (markdown fences, trailing commas, wrong-case keys, partial objects) and applies schema-aware coercion with a cost function.
Where BAML goes further:
- Dedicated
.bamlDSL with multi-language code generation (Python, TS, Ruby, Go, etc.) - VS Code playground for live prompt testing
- Compact prompt schema (BAML docs claim ~80 % fewer tokens than JSON Schema; varies by schema)
@check/@assertvalidators on output fields- Dynamic types via TypeBuilder for runtime schema changes
- Multi-modal support (images, audio as first-class prompt inputs)
- Retry policies with exponential backoff and fallback client chains
Where parsantic goes further:
- Pure Python — no DSL, no code generation step
- Native Pydantic models as the schema (no new language to learn)
- JSON Patch support for targeted updates without full regeneration
- Source grounding with character/token-level evidence alignment
- Transparent candidate scoring with inspectable flags and costs
- Multi-pass extraction with non-overlapping span merging
trustcall — Patch-Based Retry via LangChain
trustcall wraps LLM tool-calling with automatic validation and repair. When a tool call fails Pydantic validation, it asks the LLM to generate JSON Patch operations to fix the error rather than regenerating the entire output.
Where trustcall goes further:
- Simultaneous updates and insertions in one pass (
enable_inserts=True) - Works with LangChain chat models that support tool calling (broad provider coverage)
- Supports Pydantic models, plain functions, JSON Schema, and LangChain tools as input
- Graph-based execution with parallel tool-call validation (LangGraph)
- Optional deletes + policies for existing docs (
enable_deletes,existing_schema_policy)
Where parsantic goes further:
- Zero extra LLM calls — all repairs are deterministic and local
- Streaming partial objects while the LLM is still generating
- Schema-aware coercion (enum matching, type conversion) without LLM involvement
- Candidate scoring shows exactly what was changed and why
- Source grounding ties extracted values back to source text positions
- No LangChain / LangGraph dependency
llguidance — Constrained Decoding at the Token Level
llguidance takes the opposite approach: instead of fixing broken output, it prevents invalid output from being generated. At each decoding step it computes a bitmask of valid tokens and blocks everything else.
Where llguidance goes further:
- Guarantees output that conforms to the provided grammar — no post-processing needed
- Supports context-free grammars (Lark-like syntax) beyond JSON Schema
- Parametric grammars for combinatorial structures (permutations, unique lists)
- ~50 μs per token mask for a 128k tokenizer (highly optimized Rust; depends on grammar)
- Powers OpenAI Structured Outputs; integrated into vLLM, SGLang, llama.cpp, and Chromium
Where parsantic goes further:
- Works with any LLM API — no inference-engine access needed (though llguidance is also available transparently via OpenAI's Structured Outputs API)
- Handles output that is already generated (logs, cached responses, tool-call results)
- JSON Patch updates, streaming partial objects, source grounding
- Candidate scoring with transparent coercion flags
- Pure Python, no Rust compilation or special deployment
- Handles messy real-world output (markdown, comments, surrounding text) that constrained decoding never produces but APIs frequently return
LangExtract — Extraction Pipeline with Visualization
LangExtract (Google) is an extraction-focused pipeline. It chunks long documents, runs few-shot prompting in parallel, and aligns results back to source text with interactive HTML visualization.
Where LangExtract goes further:
- Interactive HTML visualization with hover tooltips and colored highlighting
- Native Vertex AI Batch API integration for cost-efficient large-scale extraction
- Provider plugin system; Gemini provider supports schema-constrained output
- Schema derived from examples (no separate schema definition needed)
Where parsantic goes further:
- Local JSON fixing and schema-aware coercion (LangExtract parses JSON/YAML but does not repair invalid JSON)
- Streaming partial objects during generation
- JSON Patch for targeted document updates
- Candidate scoring with inspectable coercion trace
- Pydantic models as the schema (type-safe, IDE-friendly)
- Works as a standalone parser without any LLM — useful for cached/logged responses
When to use what
| Scenario | Recommended tool |
|---|---|
| Parse messy LLM output into Pydantic models, no extra LLM calls | parsantic |
| Apply small updates to existing objects without regeneration | parsantic (JSON Patch) or trustcall (LLM-assisted) |
| Need source-grounded evidence spans from extracted data | parsantic or LangExtract |
| Guaranteed valid structure via cloud API (no self-hosting) | llguidance (via OpenAI Structured Outputs) |
| Own the inference engine and want grammar-level control | llguidance |
| Want a full DSL with code generation, VS Code tooling, and multi-modal | BAML |
| Production LLM orchestration with retries and fallback chains | BAML |
| Complex nested schemas that fail standard tool calling | BAML (SAP parsing) or trustcall (patch retries) |
| Validate and repair multiple tool calls in parallel | trustcall |
| Large-scale batch extraction with Vertex AI | LangExtract |
| Streaming typed partial objects during generation | parsantic or BAML |
Benchmarks
For current recommendations, strategy tradeoffs, and measured latency / provenance snapshots, see:
Development
uv sync
make test # 530 tests
make check # lint + format
make fmt # auto-fix
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parsantic-0.7.3.tar.gz.
File metadata
- Download URL: parsantic-0.7.3.tar.gz
- Upload date:
- Size: 1.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e38e49d9cd58797a4f18906ec61147af8a91f5d00446b8ced1ea2f7010d7acc6
|
|
| MD5 |
a3ba612fb95276ff4c3077ee0de24ad7
|
|
| BLAKE2b-256 |
d900fb66e859cc8ec08260f3ad79d08ea45eb922c4c5933309a3f25304dc2522
|
Provenance
The following attestation bundles were made for parsantic-0.7.3.tar.gz:
Publisher:
release.yml on elyase/parsantic
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsantic-0.7.3.tar.gz -
Subject digest:
e38e49d9cd58797a4f18906ec61147af8a91f5d00446b8ced1ea2f7010d7acc6 - Sigstore transparency entry: 1093393065
- Sigstore integration time:
-
Permalink:
elyase/parsantic@6bfffc4efe9082d5b4c183a14f96397459c63c86 -
Branch / Tag:
refs/tags/v0.7.3 - Owner: https://github.com/elyase
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6bfffc4efe9082d5b4c183a14f96397459c63c86 -
Trigger Event:
push
-
Statement type:
File details
Details for the file parsantic-0.7.3-py3-none-any.whl.
File metadata
- Download URL: parsantic-0.7.3-py3-none-any.whl
- Upload date:
- Size: 138.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f19f05af24c60d1c30370c8e0451c62db1629d783ebdc7d2067c6b300bcf8ebb
|
|
| MD5 |
21f666b292b727563bb0cb10e11bcb7d
|
|
| BLAKE2b-256 |
9640694ac4399f21f9bf93b2a559a4f5e9337ac704b97ab1cc919514863e838a
|
Provenance
The following attestation bundles were made for parsantic-0.7.3-py3-none-any.whl:
Publisher:
release.yml on elyase/parsantic
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsantic-0.7.3-py3-none-any.whl -
Subject digest:
f19f05af24c60d1c30370c8e0451c62db1629d783ebdc7d2067c6b300bcf8ebb - Sigstore transparency entry: 1093393104
- Sigstore integration time:
-
Permalink:
elyase/parsantic@6bfffc4efe9082d5b4c183a14f96397459c63c86 -
Branch / Tag:
refs/tags/v0.7.3 - Owner: https://github.com/elyase
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6bfffc4efe9082d5b4c183a14f96397459c63c86 -
Trigger Event:
push
-
Statement type: