Schema-aligned parsing

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

parsantic

The structured extraction toolkit: parse, stream, extract, update, patch, and coerce LLM output — locally, deterministically, with one clean API.

Install

uv add parsantic

For LLM extraction and update features (OpenAI, Anthropic, Gemini, etc.):

uv add "parsantic[ai]"

What it does

LLM output is messy. Models wrap JSON in markdown, add trailing commas, use wrong-case enum values, and return partial objects mid-stream. Most tools deal with this by retrying the LLM call. parsantic fixes it locally in one pass:

from enum import Enum
from pydantic import BaseModel
from parsantic import parse

class Priority(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class Task(BaseModel):
    title: str
    priority: Priority
    days_left: int
    done: bool = False

# The LLM returned this mess:
llm_output = """
Sure! Here's the task you requested:

```json
{
    // Task details
    "Title": "Fix the login bug",
    "priority": "HIGH",
    "Days-Left": "3",
    "done": false,
}
```

Let me know if you need anything else!
"""

task = parse(llm_output, Task).value
# Task(title='Fix the login bug', priority=<Priority.HIGH: 'high'>, days_left=3, done=False)

One call. No retry. Markdown fences, comments, surrounding prose, wrong-case keys, kebab-to-snake key normalization, enum coercion, string-to-int coercion, trailing commas — all handled.

It works even when the JSON is still arriving. Feed tokens as they come and get valid, typed partial objects back while the LLM is still generating:

from parsantic import parse_stream

stream = parse_stream(Task)

stream.feed('{"title": "Stre')
print(stream.parse_partial().value)
# TaskPartial(title='Stre', priority=None, days_left=None, done=None)  ← partial but typed

stream.feed('aming task", "pr')
print(stream.parse_partial().value)
# TaskPartial(title='Streaming task', priority=None, days_left=None, done=None)

stream.feed('iority": "low", "days_left": 5}')
task = stream.finish().value
# Task(title='Streaming task', priority=<Priority.LOW: 'low'>, days_left=5, done=False)

Every call to parse_partial() returns a valid Pydantic object (a generated TaskPartial with all-optional fields) with whatever values are available so far. No waiting for the full response.

Extract from text (requires `[ai]` extra)

Turn unstructured text into typed objects — with source grounding:

from pydantic import BaseModel
from parsantic import extract

class Person(BaseModel):
    name: str
    role: str
    years_experience: int

result = extract(
    "Dr. Sarah Chen is a principal ML engineer at Anthropic (3 years).",
    Person,
    model="openai:gpt-4o-mini",
)
result.value
# Person(name='Sarah Chen', role='principal ML engineer', years_experience=3)

# Every extracted value is grounded back to the source text
result.evidence[0]
# FieldEvidence(path='/name', value_preview='Sarah Chen', char_interval=(4, 14), ...)

Coerce tool arguments

LLM tool calls return raw dicts with wrong types and casing. coerce() fixes them against your schema — no string parsing needed:

from parsantic import coerce

# Raw dict from an LLM tool call
tool_args = {"title": "Deploy", "priority": "HIGH", "days_left": "2", "done": "true"}

task = coerce(tool_args, Task).value
# Task(title='Deploy', priority=<Priority.HIGH: 'high'>, days_left=2, done=True)

The coercion engine handles case-insensitive and accent-insensitive enum matching, string-to-number conversion, key normalization, and more — each tracked with a penalty score so the least-edited interpretation always wins.

Update existing objects (requires `[ai]` extra)

Once you've extracted a large object, new information may arrive. Asking the LLM to regenerate all 50 fields risks silently dropping data it wasn't paying attention to. update() handles this — it asks the LLM to produce only the changes as JSON Patch operations, applies them, and validates the result:

from pydantic import BaseModel
from parsantic import update

class User(BaseModel):
    name: str
    role: str
    skills: list[str]
    years_experience: int

profile = {
    "name": "Alex Chen",
    "role": "Software Engineer",
    "skills": ["Python", "TypeScript", "SQL"],
    "years_experience": 3,
}

result = update(
    existing=profile,
    instruction="Alex got promoted to Senior Engineer and picked up Rust.",
    target=User,
    model="openai:gpt-4o-mini",
)
result.value
# User(name='Alex Chen', role='Senior Software Engineer',
#      skills=['Python', 'TypeScript', 'SQL', 'Rust'], years_experience=5)
result.patches
# [JsonPatchOp(op='replace', path='/role', value='Senior Software Engineer'),
#  JsonPatchOp(op='replace', path='/years_experience', value=5),
#  JsonPatchOp(op='add', path='/skills/-', value='Rust')]

The original document is never mutated. Under the hood, update() prompts the LLM for RFC 6902 patches, parses the messy response with parse(), applies the patches with safety rails (remove disabled by default), and validates the result with schema-aware coercion. If validation fails, it automatically retries with the error context.

Extract from PDFs and images (requires `[ai]` extra)

Pass a Document instead of a string to extract structured data from visual content — scanned invoices, screenshots, research papers:

from pathlib import Path
from pydantic import BaseModel
from parsantic import extract
from parsantic.extract import Document

class Invoice(BaseModel):
    invoice_number: str
    vendor: str
    total: float

result = extract(
    Document.from_pdf(Path("invoice.pdf")),
    Invoice,
    model="gemini:gemini-3.1-flash-lite-preview",
)
result.value
# Invoice(invoice_number='INV-2024-001', vendor='Acme Corp', total=1250.00)

Images work the same way:

result = extract(
    Document.from_image(Path("receipt.jpg")),
    Invoice,
    model="openai:gpt-4o-mini",
)

By default, PDFs with a text layer are extracted as text (no vision cost); otherwise pages are rasterized to images.

For a richer end-to-end example, see:

examples/demo_pdf.py for a synthetic oncology summary extracted into a FHIR-shaped bundle with page provenance
examples/demo_pdf_modes.py for a side-by-side comparison of the PDF modes

Use mode when you want to force a higher-level PDF strategy:

from parsantic.extract import ExtractOptions

# Whole document in one call.
result = extract(
    Document.from_pdf(pdf_bytes),
    Invoice,
    model="gemini:gemini-3.1-flash-lite-preview",
    options=ExtractOptions(mode="document", document_input="native"),
)

# Page-by-page vision with page provenance.
result = extract(
    Document.from_pdf(pdf_bytes),
    Invoice,
    model="gemini:gemini-3.1-flash-lite-preview",
    options=ExtractOptions(mode="page"),
)

# Hybrid: whole-document native PDF + page images for page-grounded fields.
result = extract(
    Document.from_pdf(pdf_bytes),
    Invoice,
    model="gemini:gemini-3.1-flash-lite-preview",
    options=ExtractOptions(
        mode="hybrid",
        document_input="native",
        page_input="image",
    ),
)

print(result.sources["/total"])   # SourceRef(scope="page", pages=(1,))
print(result.sources["/vendor"])  # SourceRef(scope="document", pages=())

For lower-level PDF/image control, MediaOptions is still available:

from parsantic.extract.options import ExtractOptions, MediaOptions

result = extract(
    Document.from_pdf(pdf_bytes),
    Invoice,
    model="openai:gpt-4o-mini",
    options=ExtractOptions(
        media=MediaOptions(pdf_mode="raster", page_strategy="map_reduce"),
    ),
)

`mode`	Behavior
`"auto"`	Use text extraction for text-layer PDFs, otherwise rasterize pages (default)
`"document"`	Run one whole-document extraction
`"page"`	Run page-by-page extraction
`"hybrid"`	Run both a whole-document branch and a page branch, then merge

`document_input`	Behavior
`"auto"`	Let parsantic choose the whole-document representation
`"native"`	Send the raw PDF binary to the model
`"image"`	Rasterize the PDF and bundle page images into one whole-document request

`page_input`	Behavior
`"auto"`	Use the default page-grounded representation
`"image"`	Rasterize each PDF page to an image

Advanced MediaOptions:

`pdf_mode`	Behavior
`"auto"`	Text layer → text extraction; otherwise rasterize (default)
`"native"`	Send raw PDF binary to the model
`"raster"`	Convert every page to JPEG/PNG

Vertex AI support

Use vertex: prefix with any Gemini model to route through Vertex AI:

result = extract(
    "Dr. Sarah Chen is a principal ML engineer at Anthropic.",
    Person,
    model="vertex:gemini-2.5-flash",
    provider_kwargs={"project_id": "my-project", "region": "us-central1"},
)

Credentials are resolved automatically from environment variables (VERTEX_PROJECT_ID, VERTEX_REGION, GOOGLE_APPLICATION_CREDENTIALS) or from gcloud auth application-default login.

Native structured output

When the model supports it (Gemini, OpenAI, etc.), parsantic can use the provider's native JSON schema constraints instead of prompt-based extraction. This is enabled by default ("auto") and falls back to prompt mode transparently:

result = extract(
    text,
    MySchema,
    model="gemini:gemini-3.1-flash-lite-preview",
    options=ExtractOptions(structured_output="native"),  # or "auto" (default)
)

`structured_output`	Behavior
`"auto"`	Use native mode if the model supports it, otherwise prompt (default)
`"native"`	Force native JSON schema constraints
`"prompt"`	Always use prompt-based extraction

If native mode fails validation, parsantic automatically recovers the raw JSON from the response and runs it through the local repair pipeline.

Streaming extraction

When the provider can stream structured output, parsantic can surface typed partial objects during extraction, not just during low-level parsing.

For PDFs, the current streaming path is intentionally narrow: use a single whole-document request (pdf_mode="native" or page_strategy="single"), not page map-reduce or hybrid mode.

from pydantic import BaseModel

from parsantic import extract_stream
from parsantic.extract import Document, ExtractOptions, MediaOptions


class Invoice(BaseModel):
    vendor: str = ""
    total: float


events = extract_stream(
    Document.from_pdf("invoice.pdf"),
    Invoice,
    model="gemini:gemini-2.5-flash",
    options=ExtractOptions(
        structured_output="native",
        media=MediaOptions(pdf_mode="native", page_strategy="single"),
    ),
)

for event in events:
    if event.is_final:
        print("final:", event.result.value)
    else:
        print("partial:", event.value)

Candidate scoring

When the input is ambiguous, parsantic generates multiple candidate interpretations and picks the one requiring the fewest transformations:

from parsantic import parse_debug

debug = parse_debug('{"title": "Review PR", "priority": "Critical", "days_left": 1}', Task)
for c in debug.candidates:
    print(f"  score={c.score}  flags={c.flags}")
# score=-1  flags=()                    ← direct JSON parse (failed validation)
# score=3   flags=('case_insensitive',) ← coerced "Critical" → Priority.CRITICAL
print(debug.value)
# Task(title='Review PR', priority=<Priority.CRITICAL: 'critical'>, days_left=0, done=False)

Every coercion is tagged with a flag and a cost. You can inspect exactly what happened and why.

Comparison with similar libraries

parsantic focuses on one thing: getting a valid typed object from messy LLM text with the least effort and fewest LLM calls.

Quick comparison

	parsantic	BAML	trustcall	llguidance	LangExtract
Approach	Fix output locally	Fix output locally	Patch via LLM retry	Prevent at token level	LLM extraction pipeline
Handles invalid JSON text	Yes (local repair)	Yes (local repair)	N/A (tool calling)	N/A (constrained decoding)	No (expects valid JSON/YAML; can strip fences)
Streaming	Typed partial objects (Partial model)	Typed partial objects (generated Partial types)	—	Token-level masks	—
Updates	JSON Patch (targeted)	—	JSON Patch (LLM-generated)	—	Re-run extraction
Source grounding	Char/token spans + alignment	—	—	—	Char spans + fuzzy alignment
Schema	Pydantic models	`.baml` DSL	Pydantic / functions / JSON Schema	JSON Schema (subset) / CFG / regex	Example-driven (own format)
Candidate scoring	Weighted flags, inspectable	Scoring heuristics (internal)	—	—	—
Install	`pip install`	`pip install` (+ BAML CLI / codegen)	`pip install`	`pip install` (Rust-backed)	`pip install`
Extra LLM calls on validation failure	0	0	Yes (patch retries; configurable)	0	0

Detailed comparison

Each tool takes a fundamentally different approach to the structured-output problem. Here is how they differ in practice.

BAML — Schema-Aligned Parsing in Rust

BAML is the closest in philosophy: let the LLM generate freely, then fix the output locally. Its Rust-based parser handles the same classes of breakage (markdown fences, trailing commas, wrong-case keys, partial objects) and applies schema-aware coercion with a cost function.

Where BAML goes further:

Dedicated .baml DSL with multi-language code generation (Python, TS, Ruby, Go, etc.)
VS Code playground for live prompt testing
Compact prompt schema (BAML docs claim ~80 % fewer tokens than JSON Schema; varies by schema)
@check / @assert validators on output fields
Dynamic types via TypeBuilder for runtime schema changes
Multi-modal support (images, audio as first-class prompt inputs)
Retry policies with exponential backoff and fallback client chains

Where parsantic goes further:

Pure Python — no DSL, no code generation step
Native Pydantic models as the schema (no new language to learn)
JSON Patch support for targeted updates without full regeneration
Source grounding with character/token-level evidence alignment
Transparent candidate scoring with inspectable flags and costs
Multi-pass extraction with non-overlapping span merging

trustcall — Patch-Based Retry via LangChain

trustcall wraps LLM tool-calling with automatic validation and repair. When a tool call fails Pydantic validation, it asks the LLM to generate JSON Patch operations to fix the error rather than regenerating the entire output.

Where trustcall goes further:

Simultaneous updates and insertions in one pass (enable_inserts=True)
Works with LangChain chat models that support tool calling (broad provider coverage)
Supports Pydantic models, plain functions, JSON Schema, and LangChain tools as input
Graph-based execution with parallel tool-call validation (LangGraph)
Optional deletes + policies for existing docs (enable_deletes, existing_schema_policy)

Where parsantic goes further:

Zero extra LLM calls — all repairs are deterministic and local
Streaming partial objects while the LLM is still generating
Schema-aware coercion (enum matching, type conversion) without LLM involvement
Candidate scoring shows exactly what was changed and why
Source grounding ties extracted values back to source text positions
No LangChain / LangGraph dependency

llguidance — Constrained Decoding at the Token Level

llguidance takes the opposite approach: instead of fixing broken output, it prevents invalid output from being generated. At each decoding step it computes a bitmask of valid tokens and blocks everything else.

Where llguidance goes further:

Guarantees output that conforms to the provided grammar — no post-processing needed
Supports context-free grammars (Lark-like syntax) beyond JSON Schema
Parametric grammars for combinatorial structures (permutations, unique lists)
~50 μs per token mask for a 128k tokenizer (highly optimized Rust; depends on grammar)
Powers OpenAI Structured Outputs; integrated into vLLM, SGLang, llama.cpp, and Chromium

Where parsantic goes further:

Works with any LLM API — no inference-engine access needed (though llguidance is also available transparently via OpenAI's Structured Outputs API)
Handles output that is already generated (logs, cached responses, tool-call results)
JSON Patch updates, streaming partial objects, source grounding
Candidate scoring with transparent coercion flags
Pure Python, no Rust compilation or special deployment
Handles messy real-world output (markdown, comments, surrounding text) that constrained decoding never produces but APIs frequently return

LangExtract — Extraction Pipeline with Visualization

LangExtract (Google) is an extraction-focused pipeline. It chunks long documents, runs few-shot prompting in parallel, and aligns results back to source text with interactive HTML visualization.

Where LangExtract goes further:

Interactive HTML visualization with hover tooltips and colored highlighting
Native Vertex AI Batch API integration for cost-efficient large-scale extraction
Provider plugin system; Gemini provider supports schema-constrained output
Schema derived from examples (no separate schema definition needed)

Where parsantic goes further:

Local JSON fixing and schema-aware coercion (LangExtract parses JSON/YAML but does not repair invalid JSON)
Streaming partial objects during generation
JSON Patch for targeted document updates
Candidate scoring with inspectable coercion trace
Pydantic models as the schema (type-safe, IDE-friendly)
Works as a standalone parser without any LLM — useful for cached/logged responses

When to use what

Scenario	Recommended tool
Parse messy LLM output into Pydantic models, no extra LLM calls	parsantic
Apply small updates to existing objects without regeneration	parsantic (JSON Patch) or trustcall (LLM-assisted)
Need source-grounded evidence spans from extracted data	parsantic or LangExtract
Guaranteed valid structure via cloud API (no self-hosting)	llguidance (via OpenAI Structured Outputs)
Own the inference engine and want grammar-level control	llguidance
Want a full DSL with code generation, VS Code tooling, and multi-modal	BAML
Production LLM orchestration with retries and fallback chains	BAML
Complex nested schemas that fail standard tool calling	BAML (SAP parsing) or trustcall (patch retries)
Validate and repair multiple tool calls in parallel	trustcall
Large-scale batch extraction with Vertex AI	LangExtract
Streaming typed partial objects during generation	parsantic or BAML

Development

uv sync
make test        # 514 tests
make check       # lint + format
make fmt         # auto-fix

License

Apache-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

elyase

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.3

Mar 13, 2026

0.7.2

Mar 12, 2026

0.7.1

Mar 12, 2026

0.7.0

Mar 10, 2026

0.6.0

Mar 9, 2026

0.5.1

Mar 9, 2026

This version

0.5.0

Mar 6, 2026

0.4.3

Mar 6, 2026

0.4.1

Mar 4, 2026

0.4.0

Mar 4, 2026

0.3.0

Feb 24, 2026

0.2.0

Feb 16, 2026

0.1.1

Feb 4, 2026

0.1.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsantic-0.5.0.tar.gz (348.6 kB view details)

Uploaded Mar 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

parsantic-0.5.0-py3-none-any.whl (114.2 kB view details)

Uploaded Mar 6, 2026 Python 3

File details

Details for the file parsantic-0.5.0.tar.gz.

File metadata

Download URL: parsantic-0.5.0.tar.gz
Upload date: Mar 6, 2026
Size: 348.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parsantic-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`93c2cbac8ad09a3637ccdabf204d75699e6ec02a1a074f4f445b71a0593f830b`
MD5	`a17f50f2e138c7c0a613067cba4229e4`
BLAKE2b-256	`d40a69813849c44a7ce4832f2ae9c83a4db26758eec06022b5813271b073db74`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsantic-0.5.0.tar.gz:

Publisher: release.yml on elyase/parsantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parsantic-0.5.0.tar.gz
- Subject digest: 93c2cbac8ad09a3637ccdabf204d75699e6ec02a1a074f4f445b71a0593f830b
- Sigstore transparency entry: 1050648286
- Sigstore integration time: Mar 6, 2026
Source repository:
- Permalink: elyase/parsantic@542b6554536633233fb3fd957a71f87106878f6c
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/elyase
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@542b6554536633233fb3fd957a71f87106878f6c
- Trigger Event: push

File details

Details for the file parsantic-0.5.0-py3-none-any.whl.

File metadata

Download URL: parsantic-0.5.0-py3-none-any.whl
Upload date: Mar 6, 2026
Size: 114.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for parsantic-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7d52789c455c138dd64448f131fd5032e15183e7d0a28f9e9f4f7a4183fa941b`
MD5	`b0715a65dc960f6d2b3092f24e835877`
BLAKE2b-256	`43ebe549e4cac6c82d4ac9f7ccdd55be86391aa25a5f09362debc0ed27a26214`

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsantic-0.5.0-py3-none-any.whl:

Publisher: release.yml on elyase/parsantic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: parsantic-0.5.0-py3-none-any.whl
- Subject digest: 7d52789c455c138dd64448f131fd5032e15183e7d0a28f9e9f4f7a4183fa941b
- Sigstore transparency entry: 1050648289
- Sigstore integration time: Mar 6, 2026
Source repository:
- Permalink: elyase/parsantic@542b6554536633233fb3fd957a71f87106878f6c
- Branch / Tag: refs/tags/v0.5.0
- Owner: https://github.com/elyase
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@542b6554536633233fb3fd957a71f87106878f6c
- Trigger Event: push

parsantic 0.5.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

parsantic

Install

What it does

Extract from text (requires [ai] extra)

Coerce tool arguments

Update existing objects (requires [ai] extra)

Extract from PDFs and images (requires [ai] extra)

Vertex AI support

Native structured output

Streaming extraction

Candidate scoring

Comparison with similar libraries

Quick comparison

Detailed comparison

BAML — Schema-Aligned Parsing in Rust

trustcall — Patch-Based Retry via LangChain

llguidance — Constrained Decoding at the Token Level

LangExtract — Extraction Pipeline with Visualization

When to use what

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Extract from text (requires `[ai]` extra)

Update existing objects (requires `[ai]` extra)

Extract from PDFs and images (requires `[ai]` extra)