a schema-first grounded proposal engine
Project description
extractx
a schema-first grounded extraction engine.
extractx processes one already-scoped document at a time. Given a document and a pydantic schema, it produces grounded field observations, typed extraction instances, source evidence, and replay artifacts.
Production extraction follows a grounded-classification pattern: deterministic producers find candidate evidence with source spans, then LLM-backed classifiers choose among those bounded candidate IDs. The LLM does not author raw values, normalized values, evidence spans, or domain identity. Formatting, normalization, validation, and sealing remain deterministic after the observation decision.
Deterministic instance assignment is intentionally not the production
multi-instance path. Cardinality.ONE uses one synthetic extraction
instance for single-instance fixtures, CI baselines, and simple
one-instance documents. Cardinality.MANY requires an instance candidate
strategy binding plus an instance proposer binding. The candidate strategy
builds the bounded menu; the LLM-backed proposer selects from it.
It does not own downstream domain identity. Consumer systems decide
whether an extraction instance maps to a business entity such as a
tax-return return_id, customer account, case file, or invoice record.
That correlation layer belongs outside extractx, using
extraction instances, evidence, observations, and replay as inputs.
The intended shape is:
extractx:
document + schema -> Extraction + ReplayArtifact
consumer:
Extraction + domain rules -> business entities / facts / state
The planned dry-run surface is an inspectable extraction plan: static dry-run shows compiled bindings and required capabilities; grounded dry-run also shows deterministic field and instance candidate menus without calling an LLM or writing replay.
see docs/architecture.md for the system design, CODEX.md for the repo-local operating guide, and AGENTS.md for the generic working doctrine.
install
uv sync
Optional LLM selector/proposer support lives behind the pydantic_ai extra:
[project]
dependencies = ["extractx[pydantic_ai]"]
[tool.uv.sources]
extractx = { path = "../extractx", editable = true }
Declare the extra on the dependency string; keep the editable path in
tool.uv.sources. If your workspace manager cannot compose extras with path
sources, add pydantic-ai>=1.99.0 directly in the consuming project.
Optional spaCy NER candidate generation lives behind the spacy extra:
[project]
dependencies = ["extractx[spacy]"]
model_id="en" uses spacy.blank("en"), which is enough for configured
EntityRuler patterns and core tests. If you want a pretrained spaCy model,
install that model separately and pass its package name as model_id.
candidate strategies
ValueKind says what kind of value a field expects. Candidate strategies say
where candidate evidence comes from. Strategies are explicit bindings; extractx
does not silently attach NER or regex based on ValueKind.
from typing import Annotated
from pydantic import BaseModel
from extractx import ValueKind, extract_field
from extractx.candidates import NerCandidateStrategy, NerEntityRulerConfig
from extractx.core import StrategyBinding
class InvoiceSummary(BaseModel):
total_due: Annotated[str, ValueKind.MONEY] = extract_field(
description="invoice total due",
strategy_bindings=(
StrategyBinding(
cls=NerCandidateStrategy,
kind="candidate",
params={
"model_id": "en",
"entity_rulers": (
NerEntityRulerConfig(
name="invoice_money",
patterns=(
{"label": "MONEY", "pattern": "$42.50"},
),
).model_dump(mode="json"),
),
"entity_filter": ("MONEY",),
},
),
),
)
NER candidates are text candidates. They flow through the same candidate set, filter, selector, validation, evidence, and replay contracts as regex candidates.
source spans
Evidence.source_span.byte_start and byte_end are byte offsets, not Python
string indexes. For text_anchor_space="source_bytes", they index the UTF-8
source bytes stored under source_ref; for text_anchor_space="normalized_text",
they index DocumentView.normalized_text.encode("utf-8").
When highlighting inside a Python str, convert the byte span first:
from extractx import slice_utf8_byte_span, utf8_byte_span_to_char_range
start, end = utf8_byte_span_to_char_range(document_text, evidence.source_span)
assert document_text[start:end] == evidence.evidence_text
assert slice_utf8_byte_span(document_text, evidence.source_span) == evidence.evidence_text
Use this projection only when the supplied document_text.encode("utf-8")
matches the bytes addressed by the span.
value kinds and runtime types
ValueKind is not a parser. It is a semantic tag on a Python type. The Python
annotation controls the normalized runtime shape.
from typing import Annotated
from extractx import ValueKind
count_text = Annotated[str, ValueKind.CARDINAL]
count_int = Annotated[int, ValueKind.CARDINAL]
If source evidence says "20 items", count_text can normalize to the string
"20 items". count_int asks pydantic to produce an int; if the raw
candidate text is still the full phrase, validation fails unless the annotation
contains a pre-coercion parser or the candidate strategy emits only the numeric
span.
For pydantic-backed schemas, put phrase-to-type parsers in the annotation with
BeforeValidator so extractx's isolated field validation sees them before
pydantic type coercion:
from typing import Annotated
from pydantic import BeforeValidator
from extractx import ValueKind
def parse_count(value: object) -> object:
if isinstance(value, str) and value.startswith("20 "):
return 20
return value
count_int = Annotated[int, BeforeValidator(parse_count), ValueKind.CARDINAL]
Class-level pydantic field_validators run after extractx's pydantic coercion
step in the pydantic-backed path, so they should validate already-coerced
values rather than parse raw evidence phrases.
Use ValueKind.CARDINAL for count-like quantities. Use the Python type and
annotation-level validators to define the exact normalized value.
object validators
Use object validators for cross-field checks within one extracted object. They
return structured issues instead of raising exceptions, so execution strategies
can later decide whether to retry, repair, accept with warning, or fail.
The current registration form is schema-method based; use a decorated
@staticmethod and call shared helper functions from that method when multiple
schemas need the same rule.
from datetime import date
from typing import Annotated
from pydantic import BaseModel
from extractx import ObjectIssue, ValueKind, extract_field, extractx_object_validator
class ScheduledEvent(BaseModel):
start_date: Annotated[date, ValueKind.DATE] = extract_field(
description="event start date",
)
end_date: Annotated[date, ValueKind.DATE] = extract_field(
description="event end date",
)
@staticmethod
@extractx_object_validator(implicates=("start_date", "end_date"))
def dates_ordered(values, evidence):
del evidence
if values["end_date"] < values["start_date"]:
return ObjectIssue(
code="date_order",
reason="end_date must be on or after start_date",
)
return None
Object validators run after field validation and resolution. Warning issues are
diagnostic; error issues block the instance and surface as validation
negatives with structured object_issues. Error issues do not remove the
instance from Extraction.instances; they flip the instance to partial and
append a validation negative.
If an ObjectIssue omits implicates, extractx fills them from the
@extractx_object_validator(implicates=...) metadata. If the returned issue
sets implicates, that narrower issue-specific set is preserved.
Use ExecutorPolicy(strategy="iterative") to enable the bounded repair path for
single-instance specs. The executor first runs the normal bounded extraction. If
field validation fails, extractx retries that field once with the pydantic or
manual validation reason in ContextPack.retry_feedback. It then resolves and
runs object validators; if object validators emit error issues, extractx retries
only the implicated fields once with the issue reasons in
ContextPack.retry_feedback, then validates the object again. Candidate sets are
not mechanically filtered during either retry; the same validators still own
truth after repair.
test
uv run pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extractx-0.1.0.tar.gz.
File metadata
- Download URL: extractx-0.1.0.tar.gz
- Upload date:
- Size: 910.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1425e6fbcc6e4b6936cd361290132d82c2fbf2bc5de1f80281763dbc9bcafa0f
|
|
| MD5 |
9e6bbbe200c214ec5d437a78f5002149
|
|
| BLAKE2b-256 |
8ad0dabdf0304cfb854dd5d68c57759047ed87708e98037ac93b96f2e32e33e9
|
File details
Details for the file extractx-0.1.0-py3-none-any.whl.
File metadata
- Download URL: extractx-0.1.0-py3-none-any.whl
- Upload date:
- Size: 278.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca5a43e43c8251af77da00ac4e3b360b45cd824547eca238f1b110ce528441be
|
|
| MD5 |
caa6ace6a6e44f169e7d40c1024947fc
|
|
| BLAKE2b-256 |
6c165837d678ab09eb7585af6d3d415bdb1f39957dc500c304c39a763db08309
|