Deterministic structured extraction from noisy JSON-like model output.
Project description
confident-extract
confident-extract is a small Python library for deterministic structured extraction from noisy JSON-like model output.
The current public alpha surface is synchronous and msgspec-first:
from confident_extract import extract- deterministic preprocessing and JSON repair
- strict
msgspec.Structvalidation - lightweight result metadata around the validated output
Project overview
The library is built for the common case where an upstream model or OCR system returns JSON-like text that is close to valid, but not always valid enough to parse or validate directly.
The current sync pipeline is:
- preprocess raw text
- repair malformed JSON conservatively
- validate against a
msgspec.Structschema - return a typed
ExtractionResult
The package does not currently include provider adapters, retries, async APIs, streaming, confidence scoring, or a pydantic bridge.
Install
Install the published package:
python -m pip install confident-extract
Install for local development:
python -m pip install -e ".[dev]"
Quickstart example
import msgspec
from confident_extract import extract
class Invoice(msgspec.Struct):
invoice_id: int
status: str
total_cents: int
result = extract(
text='{"invoice_id": 42, "status": "paid", "total_cents": 1999}',
schema=Invoice,
)
assert result.data == Invoice(invoice_id=42, status="paid", total_cents=1999)
assert result.repair_applied is False
Malformed JSON repair example
import msgspec
from confident_extract import extract
class Invoice(msgspec.Struct):
invoice_id: int
status: str
total_cents: int
raw = "{invoice_id: 42, status: 'paid', total_cents: 1999,}"
result = extract(text=raw, schema=Invoice)
assert result.data.status == "paid"
assert result.repair_applied is True
assert result.repaired_text == (
'{"invoice_id": 42, "status": "paid", "total_cents": 1999}'
)
Nested schema example
import msgspec
from confident_extract import extract
class Contact(msgspec.Struct):
email: str
phone: str | None = None
class Customer(msgspec.Struct):
name: str
contact: Contact
class Invoice(msgspec.Struct):
invoice_id: int
customer: Customer
tags: list[str]
raw = """
{
"invoice_id": 7,
"customer": {
"name": "Acme",
"contact": {"email": "ops@example.com", "phone": "123"}
},
"tags": ["paid", "net30"]
}
"""
result = extract(text=raw, schema=Invoice)
assert result.data.customer.contact.email == "ops@example.com"
assert result.data.tags == ["paid", "net30"]
Benchmark snapshot
Current local measurements were captured on May 12, 2026 with Python 3.13.5 using:
python -m pytest benchmarks/test_extract_benchmarks.py --benchmark-json /tmp/confident_extract_benchmarks.json
These are local measurements only. They are useful for regression tracking, not public performance claims.
| Path | Scenario | p50 | p99 | Throughput |
|---|---|---|---|---|
preprocess() |
already-valid JSON | 2.17 us |
2.46 us |
462k ops/s |
preprocess() |
fenced ~10KB payload | 4.25 us |
13.04 us |
215k ops/s |
repair() |
valid fast path | 6.21 us |
31.25 us |
145k ops/s |
repair() |
trailing comma repair | 123.50 us |
328.29 us |
7.5k ops/s |
repair() |
multi-strategy repair | 611.38 us |
965.96 us |
1.7k ops/s |
validate_with_msgspec() |
nested decoded payload | 3.00 us |
3.21 us |
333k ops/s |
validate_with_msgspec() |
~10KB decoded payload | 27.75 us |
39.00 us |
34.6k ops/s |
extract() |
valid fast path | 7.42 us |
13.88 us |
123k ops/s |
extract() |
trailing comma repair | 73.50 us |
172.63 us |
13.2k ops/s |
extract() |
multi-strategy nested repair | 406.83 us |
820.83 us |
2.2k ops/s |
extract() |
~10KB nested payload | 92.83 us |
167.75 us |
10.5k ops/s |
extract() |
repeated ~10KB throughput | 94.69 us |
96.21 us |
10.6k ops/s |
Benchmark caveats
- The current suite is local, deterministic, and provider-free.
- Outlier behavior will vary by machine, Python version, and thermal state.
- The current repo does not yet publish benchmark baselines from CI runners.
- Instructor, Guardrails, and LangChain comparisons are planned, but not yet implemented in this repository.
How to run benchmarks
python -m pytest benchmarks/test_extract_benchmarks.py
python -m pytest benchmarks/test_extract_benchmarks.py --benchmark-sort=mean
python -m pytest benchmarks/test_extract_benchmarks.py --benchmark-json /tmp/confident_extract_benchmarks.json
Architecture flow diagram
raw input text
|
v
preprocess(text)
|
v
repair(preprocessed_text)
|
v
validate_with_msgspec(parsed payload, schema)
|
v
ExtractionResult[T]
- data
- repair_applied
- repair_attempts
- raw_input
- repaired_text
- latency_ms
Feature list
- Minimal sync API:
extract(text, schema=Invoice) - Conservative preprocessing for markdown fences, whitespace normalization, and escaped JSON
- Deterministic JSON repair for trailing commas, unterminated containers, single quotes, and bare keys
- Strict
msgspec.Structvalidation with field-path extraction on failures - Frozen, slotted extraction result contract
- Package-root exports for the public sync API
- Local benchmark coverage for preprocess, repair, validation, and full extraction
Roadmap
- Stabilize the sync extraction API for
0.1.x - Add the optional pydantic bridge outside the hot path
- Add async and streaming APIs
- Add provider adapters for live model integrations
- Add confidence scoring and retry routing
- Add reproducible cross-library benchmark comparisons
Contribution and dev setup
AGENTS.md is the repo-level implementation contract. Read it before changing the code.
Local setup:
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
pre-commit install
Quality gates:
python -m ruff check .
python -m mypy .
python -m pytest
Benchmark and release checks:
python -m pytest benchmarks/test_extract_benchmarks.py
python -m build
twine check dist/*
For contributor expectations, issue filing guidance, and release checks, see CONTRIBUTING.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file confident_extract-0.1.0a1.tar.gz.
File metadata
- Download URL: confident_extract-0.1.0a1.tar.gz
- Upload date:
- Size: 23.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10a3b7837591589459c977611cf71ea821d98e51d65209cced380fc95fd4c133
|
|
| MD5 |
a94b842ad9967d8bd3e0cd030b8ddb20
|
|
| BLAKE2b-256 |
8279df41950afa04cd5bbdb5eab56daa4e8d9c4a73cb72c7916a36cade11aac4
|
Provenance
The following attestation bundles were made for confident_extract-0.1.0a1.tar.gz:
Publisher:
publish.yml on hitarthbuilds/confident-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
confident_extract-0.1.0a1.tar.gz -
Subject digest:
10a3b7837591589459c977611cf71ea821d98e51d65209cced380fc95fd4c133 - Sigstore transparency entry: 1518759140
- Sigstore integration time:
-
Permalink:
hitarthbuilds/confident-extract@033c14728fea7cb7b58c9b6dde966e01accbf25a -
Branch / Tag:
refs/tags/v0.1.0a1 - Owner: https://github.com/hitarthbuilds
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@033c14728fea7cb7b58c9b6dde966e01accbf25a -
Trigger Event:
release
-
Statement type:
File details
Details for the file confident_extract-0.1.0a1-py3-none-any.whl.
File metadata
- Download URL: confident_extract-0.1.0a1-py3-none-any.whl
- Upload date:
- Size: 15.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7da089d72cbc43f9a71f29d93c8b2f851f12589a82ae8e9648fa914614a68fd5
|
|
| MD5 |
f33bb2624e686c1cc19343b6f09d04b2
|
|
| BLAKE2b-256 |
075ff7728a7fb405850149dedeb947871675252a892bd04a5daef9b22e428f6d
|
Provenance
The following attestation bundles were made for confident_extract-0.1.0a1-py3-none-any.whl:
Publisher:
publish.yml on hitarthbuilds/confident-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
confident_extract-0.1.0a1-py3-none-any.whl -
Subject digest:
7da089d72cbc43f9a71f29d93c8b2f851f12589a82ae8e9648fa914614a68fd5 - Sigstore transparency entry: 1518759216
- Sigstore integration time:
-
Permalink:
hitarthbuilds/confident-extract@033c14728fea7cb7b58c9b6dde966e01accbf25a -
Branch / Tag:
refs/tags/v0.1.0a1 - Owner: https://github.com/hitarthbuilds
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@033c14728fea7cb7b58c9b6dde966e01accbf25a -
Trigger Event:
release
-
Statement type: