Deterministic structured extraction from noisy LLM and OCR output. Zero LLM round-trips. Works with msgspec, Pydantic v2, and dataclasses.

These details have not been verified by PyPI

Project description

confident-extract — Structured Extraction from LLM Output

Zero LLM round-trips. Microsecond latency. Confidence score on every result.

confident-extract repairs malformed JSON from LLM or OCR output deterministically and validates it against your schema — no extra model calls, no network, no randomness. It works with msgspec, pydantic, and standard Python dataclasses.

from confident_extract import extract

result = extract(llm_output_text, Invoice)

result.data               # Invoice — fully typed and validated
result.confidence.score   # 0.95 — how clean was the input?
result.confidence.label   # "high"
result.strategy_trace     # ("remove_trailing_commas",)
result.latency_ms         # 8.3

Why confident-extract?
Install
Quickstart with Anthropic
Quickstart with OpenAI
Schema support: msgspec, Pydantic, dataclass
What the repair engine fixes
Confidence scoring
Batch extraction
Async API
Confidence-based routing and fallback
Custom repair strategies
Performance
How it compares
Architecture
FAQ
GitHub topics

Why confident-extract?

When you ask an LLM to return JSON, the output is almost valid — until it isn't. The problem is deterministic:

Trailing commas: {"id": 1,}
Single quotes: {'id': 1}
Bare keys: {id: 1}
Python literals: {"active": True}
Comments: {"id": 1 // primary key}
JSON buried in prose: "Here is the data: {...} — done."

You have three choices:

Approach	Extra cost	Latency	Reliability
Retry with structured-output prompt	+1 LLM call ($$$)	+seconds	Good
Parse and hope (`json.loads`)	zero	~0 µs	Fragile
confident-extract	zero	7–400 µs	High + scored

confident-extract is the first pass — deterministic, offline, typed, and scored. If the confidence is too low, then retry.

Install

# Core (msgspec + orjson only — no LLM SDK required)
pip install confident-extract

# With Pydantic v2 support
pip install "confident-extract[pydantic]"

# With Anthropic adapter (extract directly from Message objects)
pip install "confident-extract[anthropic]"

# With OpenAI adapter (extract directly from ChatCompletion objects)
pip install "confident-extract[openai]"

Quickstart with Anthropic

import anthropic
import msgspec
from confident_extract.providers.anthropic import extract_from_response


class Invoice(msgspec.Struct):
    invoice_id: int
    vendor: str
    total_cents: int
    paid: bool


client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Return the invoice JSON for order #42."}],
)

result = extract_from_response(message, Invoice)

print(result.data.invoice_id)        # 42
print(result.data.paid)              # True
print(result.confidence.label)       # "high"
print(result.confidence.score)       # 0.95
print(result.repair_applied)         # False — model returned clean JSON
print(result.strategy_trace)         # ()

Async version:

from confident_extract.providers.anthropic import extract_from_response_async
result = await extract_from_response_async(message, Invoice)

Quickstart with OpenAI

import openai
from pydantic import BaseModel
from confident_extract.providers.openai import extract_from_response


class Invoice(BaseModel):
    invoice_id: int
    vendor: str
    total_cents: int


client = openai.OpenAI()
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Return the invoice JSON for order #42."}],
)

result = extract_from_response(completion, Invoice)

print(result.data.vendor)
print(result.confidence.score)

Schema support

extract() accepts three schema types with no extra configuration.

msgspec.Struct — fastest, zero-allocation validation

import msgspec
from confident_extract import extract


class LineItem(msgspec.Struct):
    sku: str
    quantity: int
    unit_price_cents: int


result = extract('{"sku": "ABC-1", "quantity": 3, "unit_price_cents": 999}', LineItem)
assert result.data.sku == "ABC-1"

Pydantic v2 BaseModel — most popular

from pydantic import BaseModel, field_validator
from confident_extract import extract


class LineItem(BaseModel):
    sku: str
    quantity: int
    unit_price_cents: int

    @field_validator("quantity")
    @classmethod
    def must_be_positive(cls, v: int) -> int:
        if v <= 0:
            raise ValueError("quantity must be positive")
        return v


result = extract('{"sku": "ABC-1", "quantity": 3, "unit_price_cents": 999}', LineItem)
assert result.data.quantity == 3

Python dataclass — standard library

from dataclasses import dataclass
from confident_extract import extract


@dataclass
class LineItem:
    sku: str
    quantity: int
    unit_price_cents: int


result = extract('{"sku": "ABC-1", "quantity": 3, "unit_price_cents": 999}', LineItem)
assert result.data.sku == "ABC-1"

What the repair engine fixes

All repairs are deterministic. No LLM involved. Each strategy is applied in order and skipped if it does not mutate the input. The engine stops at the first clean parse.

Problem	Input example	Strategy
JSON buried in prose	`"Result: {...} — done."`	`extract_json_from_prose`
C-style comments	`{"id": 1 // key\n}`	`strip_json_comments`
Python literals	`{"active": True, "val": None}`	`fix_python_literals`
Trailing commas	`{"a": 1, "b": [1, 2,],}`	`remove_trailing_commas`
Truncated JSON	`{"id": 1, "name": "Ac`	`close_unterminated_json`
Single-quoted strings	`{'key': 'value'}`	`normalize_single_quotes`
Bare/unquoted keys	`{id: 1, name: "Acme"}`	`repair_unquoted_keys`
Markdown code fences	```json\n{...}\n```	preprocessor
Escaped JSON strings	`"{\"id\": 1}"`	preprocessor

Malformed JSON repair example

from confident_extract import extract
import msgspec


class Invoice(msgspec.Struct):
    invoice_id: int
    status: str
    total_cents: int


# Single quotes + bare keys + trailing comma — all fixed in one pass
raw = "{invoice_id: 99, status: 'paid', total_cents: 4999,}"
result = extract(raw, Invoice)

assert result.data.status == "paid"
assert result.repair_applied is True
assert result.repair_attempts == 2
assert "normalize_single_quotes" in result.strategy_trace
assert "repair_unquoted_keys" in result.strategy_trace

Prose-wrapped JSON

raw = """
I analyzed the order and here are the details:

{"invoice_id": 42, "status": "shipped", "total_cents": 5000}

Let me know if you need anything else.
"""
result = extract(raw, Invoice)
assert result.data.invoice_id == 42
assert result.strategy_trace[0] == "extract_json_from_prose"

Confidence scoring

Every ExtractionResult carries a ConfidenceScore computed from which repair strategies fired and their severity.

result.confidence.score          # float 0.0–1.0 (1.0 = no repair needed)
result.confidence.label          # "high" (≥0.8) / "medium" (≥0.5) / "low" (<0.5)
result.confidence.repair_penalty # total deduction from 1.0
result.strategy_trace            # ("normalize_single_quotes", "repair_unquoted_keys")

Strategy fired	Confidence penalty
`extract_json_from_prose`	−0.20
`close_unterminated_json`	−0.15
`normalize_single_quotes`	−0.10
`repair_unquoted_keys`	−0.10
`fix_python_literals`	−0.08
`strip_json_comments`	−0.05
`remove_trailing_commas`	−0.05

Use confidence to decide whether to accept, retry, or escalate:

result = extract(text, Invoice)
if result.confidence.label == "high":
    store(result.data)
elif result.confidence.label == "medium":
    store_with_flag(result.data)
else:
    queue_for_human_review(result)

Batch extraction

Process many texts in parallel with a thread pool:

from confident_extract import extract_batch

texts = [msg1.content[0].text, msg2.content[0].text, msg3.content[0].text]
results = extract_batch(texts, Invoice)

for r in results:
    print(r.data.invoice_id, r.confidence.label)

With ordered=False for slightly lower latency on uneven workloads:

results = extract_batch(texts, Invoice, ordered=False, max_workers=8)

Extract lists of items in bulk:

from confident_extract import extract_batch_list

results = extract_batch_list(array_texts, LineItem)
# results: list[ExtractionResult[list[LineItem]]]

Async API

All functions have async equivalents that offload work to a thread pool:

from confident_extract import extract_async, extract_list_async, extract_batch_async

# Single async extraction
result = await extract_async(text, Invoice)

# Async list extraction
result = await extract_list_async(array_text, LineItem)

# Async batch with concurrency cap
results = await extract_batch_async(texts, Invoice, max_concurrency=16)

Confidence-based routing and fallback

extract_with_routing — automatic fallback on low confidence

from confident_extract import extract_with_routing, RoutingConfig

def reprompt(result):
    """Re-prompt the model with a stricter instruction."""
    new_text = my_llm_client.ask_again(result.raw_input)
    return extract(new_text, Invoice)

config = RoutingConfig(min_confidence=0.8, on_low_confidence=reprompt)
result = extract_with_routing(text, Invoice, config=config)

Raise on low confidence

from confident_extract import LowConfidenceError

config = RoutingConfig(min_confidence=0.7, raise_on_low_confidence=True)
try:
    result = extract_with_routing(text, Invoice, config=config)
except LowConfidenceError as e:
    print(e.result.strategy_trace)

filter_by_confidence — split a batch

from confident_extract import filter_by_confidence

confident, uncertain = filter_by_confidence(results, min_score=0.8)
process(confident)
review_queue.extend(uncertain)

Custom repair strategies

from confident_extract import register_strategy

def fix_nan(text: str) -> str:
    """Replace JavaScript NaN with JSON null."""
    if "NaN" not in text:
        return text  # return unchanged to skip — engine checks string identity
    return text.replace(": NaN", ": null").replace(":NaN", ":null")

register_strategy("fix_nan", fix_nan)
# All future extract() calls will try fix_nan after built-in strategies

Custom strategies appear in result.strategy_trace with the name you registered. Use unregister_strategy("fix_nan") to remove, or list_strategies() to inspect.

ExtractionResult reference

@dataclass(frozen=True, slots=True)
class ExtractionResult(Generic[T]):
    data: T                          # Validated schema instance
    repair_applied: bool             # Whether any strategy mutated the input
    repair_attempts: int             # Number of strategies that mutated
    raw_input: str                   # Original text as received
    repaired_text: str               # Text after repair, before validation
    latency_ms: float                # End-to-end wall-clock time in ms
    confidence: ConfidenceScore      # score, label, repair_penalty
    strategy_trace: tuple[str, ...]  # Names of strategies that fired, in order

Performance

All measurements on Apple M-series, Python 3.13, May 2026.

Scenario	p50	p99	Throughput
`preprocess()` — already-valid JSON	`1.1 µs`	`1.5 µs`	`890k ops/s`
`preprocess()` — fenced ~10KB	`4.3 µs`	`13 µs`	`215k ops/s`
`repair()` — valid fast path	`6.2 µs`	`31 µs`	`145k ops/s`
`repair()` — trailing comma	`124 µs`	`328 µs`	`7.5k ops/s`
`repair()` — multi-strategy	`611 µs`	`966 µs`	`1.7k ops/s`
`extract()` — valid fast path	7.4 µs	`14 µs`	123k ops/s
`extract()` — trailing comma	`74 µs`	`173 µs`	`13k ops/s`
`extract()` — multi-strategy	`407 µs`	`821 µs`	`2.2k ops/s`
`extract()` — ~10KB payload	`93 µs`	`168 µs`	`10.5k ops/s`

The preprocessor fast path skips fence-stripping for bare JSON input, cutting overhead by ~50% on the hot path.

Run benchmarks:

python -m pytest benchmarks/ --benchmark-sort=mean

How it compares to alternatives

	confident-extract	instructor	guardrails	json-repair
Extra LLM calls to fix bad JSON	0	1–N	1–N	0
Confidence score on output	✓	—	—	—
msgspec.Struct support	✓	—	—	—
Pydantic v2 support	✓	✓	✓	—
Dataclass support	✓	—	—	—
Async API	✓	✓	partial	—
Batch extraction	✓	—	—	—
Confidence routing / fallback	✓	—	—	—
Custom repair strategies	✓	—	—	—
Works offline	✓	—	—	✓
Provider adapters (Anthropic, OpenAI)	✓	✓	✓	—
Core deps: only orjson + msgspec	✓	—	—	✓

Recommended pattern: use confident-extract as the first pass. If confidence.label == "low", optionally retry with instructor or a structured-output prompt. This eliminates retry costs for the ~80–90% of outputs that are clean or lightly malformed.

Architecture

raw input text
      │
      ▼
 preprocess(text)           strip fences · CRLF · unwrap escaped JSON
      │                     fast-path: skips fence work for bare JSON
      ▼
 repair(preprocessed)       try_orjson_parse → on failure, apply in order:
      │                       1. extract_json_from_prose
      │                       2. strip_json_comments
      │                       3. fix_python_literals
      │                       4. remove_trailing_commas
      │                       5. close_unterminated_json
      │                       6. normalize_single_quotes
      │                       7. repair_unquoted_keys
      │                       + any custom registered strategies
      ▼
 validate(payload, schema)  auto-detect schema type → route:
      │                       msgspec.Struct    → msgspec.convert (strict)
      │                       pydantic.BaseModel → model_validate
      │                       dataclass          → msgspec.convert (lenient)
      ▼
 compute_confidence(trace)  score = 1.0 − Σ(per-strategy penalties)
      │                     capped at [0.10, 1.0], labeled "high"/"medium"/"low"
      ▼
 ExtractionResult[T]

FAQ

How do I extract structured data from LLM output in Python? pip install confident-extract then result = extract(llm_text, MySchema). Works with msgspec, Pydantic, and dataclasses.

Does it work without an internet connection? Yes. The entire pipeline runs in-process with no network calls.

What is the difference between confident-extract and instructor? instructor retries the model when JSON is invalid, spending another LLM call. confident-extract repairs the JSON deterministically in microseconds with no extra cost. Use both together: confident-extract as the fast first pass, instructor as the fallback on confidence.label == "low".

What is the difference between confident-extract and json-repair? json-repair returns a fixed string. confident-extract validates the fixed string against your schema, returns a typed object, and gives you a confidence score.

Does it support async? Yes: await extract_async(text, schema), await extract_batch_async(texts, schema).

How do I handle batch LLM responses? results = extract_batch(list_of_texts, Invoice) — runs in a thread pool. Async: await extract_batch_async(texts, Invoice).

How do I add my own JSON repair logic? register_strategy("my_fix", my_fn) — custom strategies run after built-ins and appear in strategy_trace.

GitHub topics

To maximize GitHub discoverability, the repository uses these topics:

llm · json-repair · structured-extraction · msgspec · pydantic · orjson · schema-validation · anthropic · openai · information-extraction · nlp · python · async · confidence-scoring · json-parsing

Development

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pre-commit install

Quality gates:

ruff check .
mypy .
pytest
pytest benchmarks/ --benchmark-sort=mean

See CONTRIBUTING.md for contributor expectations and release steps.

Citation

If you use confident-extract in research, please cite it using the CITATION.cff file. GitHub renders a "Cite this repository" button automatically.

Built by Hitarth Desai. MIT License.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 22, 2026

0.1.0a1 pre-release

May 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confident_extract-0.1.0.tar.gz (45.4 kB view details)

Uploaded Jun 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

confident_extract-0.1.0-py3-none-any.whl (32.0 kB view details)

Uploaded Jun 22, 2026 Python 3

File details

Details for the file confident_extract-0.1.0.tar.gz.

File metadata

Download URL: confident_extract-0.1.0.tar.gz
Upload date: Jun 22, 2026
Size: 45.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for confident_extract-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`df2f06609fe0ec8a7f28f174ca376b5264a5eefc476c062208b8f09551e0d88d`
MD5	`fc4ea5445da4da2c154216f59e56d875`
BLAKE2b-256	`b96928390926e348b3928cd1cce2d82306eb4c905bfe3fafad63734e2631f775`

See more details on using hashes here.

Provenance

The following attestation bundles were made for confident_extract-0.1.0.tar.gz:

Publisher: publish.yml on hitarthbuilds/confident-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: confident_extract-0.1.0.tar.gz
- Subject digest: df2f06609fe0ec8a7f28f174ca376b5264a5eefc476c062208b8f09551e0d88d
- Sigstore transparency entry: 1907831753
- Sigstore integration time: Jun 22, 2026
Source repository:
- Permalink: hitarthbuilds/confident-extract@76d23156e63996de2eb7d2abc2a683071e3b7a34
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/hitarthbuilds
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@76d23156e63996de2eb7d2abc2a683071e3b7a34
- Trigger Event: release

File details

Details for the file confident_extract-0.1.0-py3-none-any.whl.

File metadata

Download URL: confident_extract-0.1.0-py3-none-any.whl
Upload date: Jun 22, 2026
Size: 32.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for confident_extract-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9e6ce856d068bce7c0cb8c0b1b3a3c4989cbfb560e64384cae71ccc95287587b`
MD5	`9a8a8c65518cbdcd103106b93b13c419`
BLAKE2b-256	`4ba13a89bba5d769a9e3bb73c4d93a6ea1f7068c4b667b7b3edc03e4e99073dc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for confident_extract-0.1.0-py3-none-any.whl:

Publisher: publish.yml on hitarthbuilds/confident-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: confident_extract-0.1.0-py3-none-any.whl
- Subject digest: 9e6ce856d068bce7c0cb8c0b1b3a3c4989cbfb560e64384cae71ccc95287587b
- Sigstore transparency entry: 1907831851
- Sigstore integration time: Jun 22, 2026
Source repository:
- Permalink: hitarthbuilds/confident-extract@76d23156e63996de2eb7d2abc2a683071e3b7a34
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/hitarthbuilds
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@76d23156e63996de2eb7d2abc2a683071e3b7a34
- Trigger Event: release

confident-extract 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

confident-extract — Structured Extraction from LLM Output

Table of Contents

Why confident-extract?

Install

Quickstart with Anthropic

Quickstart with OpenAI

Schema support

msgspec.Struct — fastest, zero-allocation validation

Pydantic v2 BaseModel — most popular

Python dataclass — standard library

What the repair engine fixes

Malformed JSON repair example

Prose-wrapped JSON

Confidence scoring

Batch extraction

Async API

Confidence-based routing and fallback

extract_with_routing — automatic fallback on low confidence

Raise on low confidence

filter_by_confidence — split a batch

Custom repair strategies

ExtractionResult reference

Performance

How it compares to alternatives

Architecture

FAQ

GitHub topics

Development

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance