Skip to main content

Deterministic structured extraction from noisy JSON-like model output.

Project description

confident-extract

CI PyPI Python License

confident-extract is a small Python library for deterministic structured extraction from noisy JSON-like model output.

The current public alpha surface is synchronous and msgspec-first:

  • from confident_extract import extract
  • deterministic preprocessing and JSON repair
  • strict msgspec.Struct validation
  • lightweight result metadata around the validated output

Project overview

The library is built for the common case where an upstream model or OCR system returns JSON-like text that is close to valid, but not always valid enough to parse or validate directly.

The current sync pipeline is:

  1. preprocess raw text
  2. repair malformed JSON conservatively
  3. validate against a msgspec.Struct schema
  4. return a typed ExtractionResult

The package does not currently include provider adapters, retries, async APIs, streaming, confidence scoring, or a pydantic bridge.

Install

Install the published package:

python -m pip install confident-extract

Install for local development:

python -m pip install -e ".[dev]"

Quickstart example

import msgspec

from confident_extract import extract


class Invoice(msgspec.Struct):
    invoice_id: int
    status: str
    total_cents: int


result = extract(
    text='{"invoice_id": 42, "status": "paid", "total_cents": 1999}',
    schema=Invoice,
)

assert result.data == Invoice(invoice_id=42, status="paid", total_cents=1999)
assert result.repair_applied is False

Malformed JSON repair example

import msgspec

from confident_extract import extract


class Invoice(msgspec.Struct):
    invoice_id: int
    status: str
    total_cents: int


raw = "{invoice_id: 42, status: 'paid', total_cents: 1999,}"
result = extract(text=raw, schema=Invoice)

assert result.data.status == "paid"
assert result.repair_applied is True
assert result.repaired_text == (
    '{"invoice_id": 42, "status": "paid", "total_cents": 1999}'
)

Nested schema example

import msgspec

from confident_extract import extract


class Contact(msgspec.Struct):
    email: str
    phone: str | None = None


class Customer(msgspec.Struct):
    name: str
    contact: Contact


class Invoice(msgspec.Struct):
    invoice_id: int
    customer: Customer
    tags: list[str]


raw = """
{
  "invoice_id": 7,
  "customer": {
    "name": "Acme",
    "contact": {"email": "ops@example.com", "phone": "123"}
  },
  "tags": ["paid", "net30"]
}
"""

result = extract(text=raw, schema=Invoice)

assert result.data.customer.contact.email == "ops@example.com"
assert result.data.tags == ["paid", "net30"]

Benchmark snapshot

Current local measurements were captured on May 12, 2026 with Python 3.13.5 using:

python -m pytest benchmarks/test_extract_benchmarks.py --benchmark-json /tmp/confident_extract_benchmarks.json

These are local measurements only. They are useful for regression tracking, not public performance claims.

Path Scenario p50 p99 Throughput
preprocess() already-valid JSON 2.17 us 2.46 us 462k ops/s
preprocess() fenced ~10KB payload 4.25 us 13.04 us 215k ops/s
repair() valid fast path 6.21 us 31.25 us 145k ops/s
repair() trailing comma repair 123.50 us 328.29 us 7.5k ops/s
repair() multi-strategy repair 611.38 us 965.96 us 1.7k ops/s
validate_with_msgspec() nested decoded payload 3.00 us 3.21 us 333k ops/s
validate_with_msgspec() ~10KB decoded payload 27.75 us 39.00 us 34.6k ops/s
extract() valid fast path 7.42 us 13.88 us 123k ops/s
extract() trailing comma repair 73.50 us 172.63 us 13.2k ops/s
extract() multi-strategy nested repair 406.83 us 820.83 us 2.2k ops/s
extract() ~10KB nested payload 92.83 us 167.75 us 10.5k ops/s
extract() repeated ~10KB throughput 94.69 us 96.21 us 10.6k ops/s

Benchmark caveats

  • The current suite is local, deterministic, and provider-free.
  • Outlier behavior will vary by machine, Python version, and thermal state.
  • The current repo does not yet publish benchmark baselines from CI runners.
  • Instructor, Guardrails, and LangChain comparisons are planned, but not yet implemented in this repository.

How to run benchmarks

python -m pytest benchmarks/test_extract_benchmarks.py
python -m pytest benchmarks/test_extract_benchmarks.py --benchmark-sort=mean
python -m pytest benchmarks/test_extract_benchmarks.py --benchmark-json /tmp/confident_extract_benchmarks.json

Architecture flow diagram

raw input text
    |
    v
preprocess(text)
    |
    v
repair(preprocessed_text)
    |
    v
validate_with_msgspec(parsed payload, schema)
    |
    v
ExtractionResult[T]
  - data
  - repair_applied
  - repair_attempts
  - raw_input
  - repaired_text
  - latency_ms

Feature list

  • Minimal sync API: extract(text, schema=Invoice)
  • Conservative preprocessing for markdown fences, whitespace normalization, and escaped JSON
  • Deterministic JSON repair for trailing commas, unterminated containers, single quotes, and bare keys
  • Strict msgspec.Struct validation with field-path extraction on failures
  • Frozen, slotted extraction result contract
  • Package-root exports for the public sync API
  • Local benchmark coverage for preprocess, repair, validation, and full extraction

Roadmap

  • Stabilize the sync extraction API for 0.1.x
  • Add the optional pydantic bridge outside the hot path
  • Add async and streaming APIs
  • Add provider adapters for live model integrations
  • Add confidence scoring and retry routing
  • Add reproducible cross-library benchmark comparisons

Contribution and dev setup

AGENTS.md is the repo-level implementation contract. Read it before changing the code.

Local setup:

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
pre-commit install

Quality gates:

python -m ruff check .
python -m mypy .
python -m pytest

Benchmark and release checks:

python -m pytest benchmarks/test_extract_benchmarks.py
python -m build
twine check dist/*

For contributor expectations, issue filing guidance, and release checks, see CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confident_extract-0.1.0a1.tar.gz (23.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

confident_extract-0.1.0a1-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file confident_extract-0.1.0a1.tar.gz.

File metadata

  • Download URL: confident_extract-0.1.0a1.tar.gz
  • Upload date:
  • Size: 23.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for confident_extract-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 10a3b7837591589459c977611cf71ea821d98e51d65209cced380fc95fd4c133
MD5 a94b842ad9967d8bd3e0cd030b8ddb20
BLAKE2b-256 8279df41950afa04cd5bbdb5eab56daa4e8d9c4a73cb72c7916a36cade11aac4

See more details on using hashes here.

Provenance

The following attestation bundles were made for confident_extract-0.1.0a1.tar.gz:

Publisher: publish.yml on hitarthbuilds/confident-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file confident_extract-0.1.0a1-py3-none-any.whl.

File metadata

File hashes

Hashes for confident_extract-0.1.0a1-py3-none-any.whl
Algorithm Hash digest
SHA256 7da089d72cbc43f9a71f29d93c8b2f851f12589a82ae8e9648fa914614a68fd5
MD5 f33bb2624e686c1cc19343b6f09d04b2
BLAKE2b-256 075ff7728a7fb405850149dedeb947871675252a892bd04a5daef9b22e428f6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for confident_extract-0.1.0a1-py3-none-any.whl:

Publisher: publish.yml on hitarthbuilds/confident-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page