Skip to main content

a schema-first grounded proposal engine

Project description

extractx

a schema-first grounded extraction engine.

extractx processes one already-scoped document at a time. Given a document and a pydantic schema, it produces grounded field observations, typed extraction instances, source evidence, and replay artifacts.

Production extraction follows a grounded-classification pattern: deterministic producers find candidate evidence with source spans, then LLM-backed classifiers choose among those bounded candidate IDs. The LLM does not author raw values, normalized values, evidence spans, or domain identity. Formatting, normalization, validation, and sealing remain deterministic after the observation decision.

Deterministic instance assignment is intentionally not the production multi-instance path. Cardinality.ONE uses one synthetic extraction instance for single-instance fixtures, CI baselines, and simple one-instance documents. Cardinality.MANY requires an instance candidate strategy binding plus an instance proposer binding. The candidate strategy builds the bounded menu; the LLM-backed proposer selects from it.

It does not own downstream domain identity. Consumer systems decide whether an extraction instance maps to a business entity such as a tax-return return_id, customer account, case file, or invoice record. That correlation layer belongs outside extractx, using extraction instances, evidence, observations, and replay as inputs.

The intended shape is:

extractx:
  document + schema -> Extraction + ReplayArtifact

consumer:
  Extraction + domain rules -> business entities / facts / state

The planned dry-run surface is an inspectable extraction plan: static dry-run shows compiled bindings and required capabilities; grounded dry-run also shows deterministic field and instance candidate menus without calling an LLM or writing replay.

see docs/architecture.md for the system design, CODEX.md for the repo-local operating guide, and AGENTS.md for the generic working doctrine.

install

uv sync

Optional LLM selector/proposer support lives behind the pydantic_ai extra:

[project]
dependencies = ["extractx[pydantic_ai]"]

[tool.uv.sources]
extractx = { path = "../extractx", editable = true }

Declare the extra on the dependency string; keep the editable path in tool.uv.sources. If your workspace manager cannot compose extras with path sources, add pydantic-ai>=1.99.0 directly in the consuming project.

Optional spaCy NER candidate generation lives behind the spacy extra:

[project]
dependencies = ["extractx[spacy]"]

model_id="en" uses spacy.blank("en"), which is enough for configured EntityRuler patterns and core tests. If you want a pretrained spaCy model, install that model separately and pass its package name as model_id.

candidate strategies

ValueKind says what kind of value a field expects. Candidate strategies say where candidate evidence comes from. Strategies are explicit bindings; extractx does not silently attach NER or regex based on ValueKind.

from typing import Annotated

from pydantic import BaseModel

from extractx import ValueKind, extract_field
from extractx.candidates import NerCandidateStrategy, NerEntityRulerConfig
from extractx.core import StrategyBinding


class InvoiceSummary(BaseModel):
    total_due: Annotated[str, ValueKind.MONEY] = extract_field(
        description="invoice total due",
        strategy_bindings=(
            StrategyBinding(
                cls=NerCandidateStrategy,
                kind="candidate",
                params={
                    "model_id": "en",
                    "entity_rulers": (
                        NerEntityRulerConfig(
                            name="invoice_money",
                            patterns=(
                                {"label": "MONEY", "pattern": "$42.50"},
                            ),
                        ).model_dump(mode="json"),
                    ),
                    "entity_filter": ("MONEY",),
                },
            ),
        ),
    )

NER candidates are text candidates. They flow through the same candidate set, filter, selector, validation, evidence, and replay contracts as regex candidates.

source spans

Evidence.source_span.byte_start and byte_end are byte offsets, not Python string indexes. For text_anchor_space="source_bytes", they index the UTF-8 source bytes stored under source_ref; for text_anchor_space="normalized_text", they index DocumentView.normalized_text.encode("utf-8").

When highlighting inside a Python str, convert the byte span first:

from extractx import slice_utf8_byte_span, utf8_byte_span_to_char_range

start, end = utf8_byte_span_to_char_range(document_text, evidence.source_span)
assert document_text[start:end] == evidence.evidence_text
assert slice_utf8_byte_span(document_text, evidence.source_span) == evidence.evidence_text

Use this projection only when the supplied document_text.encode("utf-8") matches the bytes addressed by the span.

value kinds and runtime types

ValueKind is not a parser. It is a semantic tag on a Python type. The Python annotation controls the normalized runtime shape.

from typing import Annotated

from extractx import ValueKind

count_text = Annotated[str, ValueKind.CARDINAL]
count_int = Annotated[int, ValueKind.CARDINAL]

If source evidence says "20 items", count_text can normalize to the string "20 items". count_int asks pydantic to produce an int; if the raw candidate text is still the full phrase, validation fails unless the annotation contains a pre-coercion parser or the candidate strategy emits only the numeric span.

For pydantic-backed schemas, put phrase-to-type parsers in the annotation with BeforeValidator so extractx's isolated field validation sees them before pydantic type coercion:

from typing import Annotated

from pydantic import BeforeValidator

from extractx import ValueKind


def parse_count(value: object) -> object:
    if isinstance(value, str) and value.startswith("20 "):
        return 20
    return value


count_int = Annotated[int, BeforeValidator(parse_count), ValueKind.CARDINAL]

Class-level pydantic field_validators run after extractx's pydantic coercion step in the pydantic-backed path, so they should validate already-coerced values rather than parse raw evidence phrases.

Use ValueKind.CARDINAL for count-like quantities. Use the Python type and annotation-level validators to define the exact normalized value.

object validators

Use object validators for cross-field checks within one extracted object. They return structured issues instead of raising exceptions, so execution strategies can later decide whether to retry, repair, accept with warning, or fail. The current registration form is schema-method based; use a decorated @staticmethod and call shared helper functions from that method when multiple schemas need the same rule.

from datetime import date
from typing import Annotated

from pydantic import BaseModel

from extractx import ObjectIssue, ValueKind, extract_field, extractx_object_validator


class ScheduledEvent(BaseModel):
    start_date: Annotated[date, ValueKind.DATE] = extract_field(
        description="event start date",
    )
    end_date: Annotated[date, ValueKind.DATE] = extract_field(
        description="event end date",
    )

    @staticmethod
    @extractx_object_validator(implicates=("start_date", "end_date"))
    def dates_ordered(values, evidence):
        del evidence
        if values["end_date"] < values["start_date"]:
            return ObjectIssue(
                code="date_order",
                reason="end_date must be on or after start_date",
            )
        return None

Object validators run after field validation and resolution. Warning issues are diagnostic; error issues block the instance and surface as validation negatives with structured object_issues. Error issues do not remove the instance from Extraction.instances; they flip the instance to partial and append a validation negative.

If an ObjectIssue omits implicates, extractx fills them from the @extractx_object_validator(implicates=...) metadata. If the returned issue sets implicates, that narrower issue-specific set is preserved.

Use ExecutorPolicy(strategy="iterative") to enable the bounded repair path for single-instance specs. The executor first runs the normal bounded extraction. If field validation fails, extractx retries that field once with the pydantic or manual validation reason in ContextPack.retry_feedback. It then resolves and runs object validators; if object validators emit error issues, extractx retries only the implicated fields once with the issue reasons in ContextPack.retry_feedback, then validates the object again. Candidate sets are not mechanically filtered during either retry; the same validators still own truth after repair.

test

uv run pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extractx-0.1.0.tar.gz (910.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extractx-0.1.0-py3-none-any.whl (278.6 kB view details)

Uploaded Python 3

File details

Details for the file extractx-0.1.0.tar.gz.

File metadata

  • Download URL: extractx-0.1.0.tar.gz
  • Upload date:
  • Size: 910.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for extractx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1425e6fbcc6e4b6936cd361290132d82c2fbf2bc5de1f80281763dbc9bcafa0f
MD5 9e6bbbe200c214ec5d437a78f5002149
BLAKE2b-256 8ad0dabdf0304cfb854dd5d68c57759047ed87708e98037ac93b96f2e32e33e9

See more details on using hashes here.

File details

Details for the file extractx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: extractx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 278.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.14

File hashes

Hashes for extractx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ca5a43e43c8251af77da00ac4e3b360b45cd824547eca238f1b110ce528441be
MD5 caa6ace6a6e44f169e7d40c1024947fc
BLAKE2b-256 6c165837d678ab09eb7585af6d3d415bdb1f39957dc500c304c39a763db08309

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page