Extract structured data from documents, images, audio, and video using LLMs

These details have not been verified by PyPI

Project description

openextract

Extract structured data from documents, images, audio, and video using LLMs.

Documentation · PyPI · Changelog · Issues

openextract turns any document, image, audio, or video file into a typed Pydantic model in a single function call. Point it at a local path or a URL, pass a schema, and get back a validated object you can use directly in your code.

Features

Type-safe output. Define your shape with Pydantic; get back a validated instance.
One function, many modalities. Documents (PDF, DOCX), images, audio, and video.
Local files or URLs. Pass a path or an https:// URL — openextract handles fetching.
Bring your own model. OpenAI, Anthropic, Google, AWS Bedrock, xAI, Cohere, Hugging Face, Groq, Cerebras, Mistral, and Ollama supported out of the box via pydantic-ai.
Explicit error handling. Distinct exceptions for URL fetch, schema validation, and model errors.
100% test coverage, enforced in CI.

Installation

uv add openextract

Or with pip:

pip install openextract

Model calls require a provider SDK. Install the extra for the provider you use, for example openextract[openai], openextract[anthropic], or openextract[all] for every supported provider. The base package ships pydantic-ai-slim without provider SDKs pre-installed. If the requested provider SDK is missing, openextract raises ProviderNotInstalledError with a provider-specific pip install 'openextract[...]' command when the model prefix is known.

Requires Python 3.12+.

Quick start

from pydantic import BaseModel
from openextract import extract


class PdfInfo(BaseModel):
    summary: str
    language: str


result = extract(
    schema=PdfInfo,
    model="xai:grok-4.3",
    input_file="https://example.com/document.pdf",
    instructions="Return a two-sentence summary and the document's primary language.",
)

print(result.summary)
print(result.language)

result is a fully-validated PdfInfo instance — not a dict, not a string.

Usage

Local files

result = extract(
    schema=PdfInfo,
    model="xai:grok-4.3",
    input_file="./reports/q4.pdf",
)

Bytes or file-like objects

result = extract(schema=PdfInfo, model="xai:grok-4.3", input_file=pdf_bytes, media_type="application/pdf")
# A file-like object with .read() works too; pass media_type explicitly:
result = extract(schema=PdfInfo, model="xai:grok-4.3", input_file=open("q4.pdf", "rb"), media_type="application/pdf")

Retry on transient model errors

result = extract(
    schema=PdfInfo,
    model="xai:grok-4.3",
    input_file="./reports/q4.pdf",
    max_retries=3,
)

max_retries defaults to 0 (single attempt) and must be a non-negative integer. When set, extract retries only on ModelError and sleeps retry_backoff * (2 ** attempt) seconds (with up to 25% jitter) between attempts. retry_backoff defaults to 1.0 second and must be positive and finite.

Inspecting token usage

Use extract_with_usage when you want token counts alongside the extracted output (for cost tracking, logging, etc.).

from openextract import extract_with_usage

result, usage = extract_with_usage(
    schema=PdfInfo,
    model="xai:grok-4.3",
    input_file="./reports/q4.pdf",
)

print(result.summary)
print(f"tokens: {usage.input_tokens} in / {usage.output_tokens} out / {usage.total_tokens} total")

usage is a frozen Usage dataclass with input_tokens, output_tokens, and total_tokens fields.

Choosing a model

model follows the pydantic-ai provider prefix convention:

Provider	Example identifier	Install extra
OpenAI	`openai:gpt-5`	`openextract[openai]`
Anthropic	`anthropic:claude-sonnet-4`	`openextract[anthropic]`
Google	`google-gla:gemini-2.5-pro`	`openextract[google]`
AWS Bedrock	`bedrock:anthropic.claude-sonnet-4-20250514-v1:0`	`openextract[bedrock]`
xAI	`xai:grok-4.3`	`openextract[xai]`
Cohere	`cohere:command-r-plus`	`openextract[cohere]`
Hugging Face	`huggingface:meta-llama/Llama-3.3-70B-Instruct`	`openextract[huggingface]`
Groq	`groq:llama-3.3-70b-versatile`	`openextract[groq]`
Cerebras	`cerebras:llama3.1-70b`	`openextract[openai]`
Mistral	`mistral:mistral-large-latest`	`openextract[mistral]`
OpenRouter	`openrouter:anthropic/claude-sonnet-4`	`openextract[openrouter]`
Outlines	`outlines:transformers/meta-llama/Llama-3.2-1B-Instruct`	Install the matching `pydantic-ai-slim[outlines-*]` backend
Ollama	`ollama:llama3`	`openextract[openai]`

Ollama and Cerebras work via the openai-compatible code path — no dedicated extra is required for either.

Set the corresponding provider credentials in your environment (e.g. XAI_API_KEY for xAI). openextract loads .env automatically.

OpenRouter and Cerebras are openai-compatible (they go through the openai client under the hood), so their errors are already classified via the existing openai path — no separate exception handling is needed.

Outlines runs models locally (via HuggingFace transformers, llama-cpp, MLX, vLLM, or SGLang) and enforces JSON-schema-conforming output at the token level. Install it separately alongside the backend you want, for example pip install pydantic-ai-slim[outlines-transformers].

Command line

openextract ships with a CLI for one-shot extractions from the shell.

openextract ./reports/q4.pdf \
  --schema mypkg.schemas:Invoice \
  --model xai:grok-4.3 \
  --instructions "Pull totals and line items." \
  --output json

Batch multiple files (JSON array output):

openextract ./invoices/a.pdf ./invoices/b.pdf \
  --schema mypkg.schemas:Invoice \
  --model xai:grok-4.3

Token usage (single file):

openextract ./reports/q4.pdf \
  --schema mypkg.schemas:Invoice \
  --model xai:grok-4.3 \
  --usage

Read from stdin:

cat ./reports/q4.pdf | openextract - \
  --schema mypkg.schemas:Invoice \
  --model xai:grok-4.3 \
  --media-type application/pdf

input_file accepts one or more paths/URLs, or - for stdin (--media-type required for stdin).
--schema is a Python import path of the form module:ClassName resolving to a Pydantic model.
--model is a pydantic-ai model identifier.
--instructions is optional natural-language guidance.
--media-type sets MIME type for stdin or overrides guessing for paths/URLs.
--usage prints a JSON object with result and usage (single input only).
--output is json (default) or repr.
--max-retries / --retry-backoff match the Python API retry behavior.
--continue-on-error (batch only) keeps processing when an input fails; each failure is emitted inline as {"input", "error", "error_type"} and the command exits 7 if any input failed. Without it, a batch aborts on the first failure.

Exit codes: 0 success, 2 URL fetch error, 3 schema validation error, 4 model error, 5 other extraction error, 6 missing provider extra, 7 partial batch failure (--continue-on-error), 1 any other failure (including bad --schema paths).

Extraction errors are written to stderr; successful JSON, usage payloads, and --continue-on-error batch arrays are written to stdout. Missing provider extras exit 6 and include the same install hint as the Python API, for example pip install 'openextract[xai]'. Partial batch failures with --continue-on-error still print the full batch array to stdout, write a warning to stderr, and exit 7.

Examples

Runnable scripts live in examples/, grouped by use case (local files, bytes, URLs, images, batch, async, retries, CLI, and more). See examples/README.md for the full table.

# Run all fixture-based examples (uses OpenAI, Anthropic, and xAI — see examples/README.md)
uv run python examples/run_all.py

# Single example with the bundled sample image
uv run python examples/basic/local_file.py --fixture

See the examples/ directory for the full source.

Error handling

from openextract import (
    extract,
    UrlFetchError,
    SchemaValidationError,
    ModelError,
    ProviderNotInstalledError,
    ExtractionError,
)

try:
    result = extract(schema=PdfInfo, model="xai:grok-4.3", input_file=url)
except UrlFetchError:
    ...  # The URL could not be fetched
except SchemaValidationError:
    ...  # The model's output did not match your schema
except ProviderNotInstalledError:
    ...  # The provider extra isn't installed (e.g. pip install openextract[xai])
except ModelError:
    ...  # The model provider returned an error
except ExtractionError:
    ...  # Any other extraction failure (base class)

All openextract exceptions inherit from ExtractionError, so you can catch it as a single fallback if you prefer.

API reference

`extract(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)`

Argument	Type	Description
`schema`	`type[BaseModel]`	A Pydantic model class describing the desired output shape.
`model`	`str`	A `pydantic-ai` model identifier (e.g. `"xai:grok-4.3"`).
`input_file`	`str \| bytes \| BinaryIO`	A local file path, an `https://` URL, raw `bytes`, or a binary file-like object with a `.read()` method.
`instructions`	`str \| None`	Optional natural-language guidance for the model.
`media_type`	`str \| None` (keyword-only)	MIME type. Required for `bytes` and file-like inputs; overrides the guessed type for `str` inputs when provided.
`max_retries`	`int` (keyword-only)	Extra attempts after a `ModelError`. Must be a non-negative integer. Defaults to `0` (no retry).
`retry_backoff`	`float` (keyword-only)	Base seconds for exponential backoff with jitter between retries. Must be positive and finite.

Returns an instance of schema.

`extract_async(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)`

Async counterpart to extract. Uses Agent.run instead of run_sync. Accepts the same schema, model, input_file, instructions, media_type, max_retries, and retry_backoff arguments.

Returns an instance of schema.

`extract_with_usage(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)`

Like extract, but returns (output, Usage) where Usage is a frozen dataclass with input_tokens, output_tokens, and total_tokens. Useful for cost tracking and logging. Uses the same ModelError retry behavior as extract.

`extract_with_usage_async(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)`

Async sibling of extract_with_usage; returns (output, Usage).

`extract_many(schema, model, input_files, instructions=None, *, media_type=None, max_concurrency=5, return_exceptions=False, max_retries=0, retry_backoff=1.0)`

Run concurrent extractions from synchronous code. Each item in input_files is a path, URL, bytes, or file-like object (same rules as extract). Results are returned in input order.

Argument	Type	Description
`input_files`	`Iterable[str \| bytes \| BinaryIO]`	One input per extraction.
`media_type`	`str \| None` (keyword-only)	Applied uniformly to every item; required if any item is `bytes`/file-like.
`max_concurrency`	`int` (keyword-only)	Maximum in-flight extractions. Must be a positive integer. Defaults to `5`.
`return_exceptions`	`bool` (keyword-only)	If `True`, exceptions appear in the result list instead of being raised.
`max_retries`	`int` (keyword-only)	Per-item extra attempts after a `ModelError`. Must be a non-negative integer. Defaults to `0`.
`retry_backoff`	`float` (keyword-only)	Base seconds for per-item exponential backoff with jitter. Must be positive and finite.

Returns a list of schema instances (or exceptions when return_exceptions=True).

`extract_many_async(schema, model, input_files, instructions=None, *, media_type=None, max_concurrency=5, return_exceptions=False, max_retries=0, retry_backoff=1.0)`

Async sibling of extract_many; same arguments and return shape.

`Usage`

Frozen dataclass returned by extract_with_usage / extract_with_usage_async:

Field	Type	Description
`input_tokens`	`int`	Prompt tokens consumed.
`output_tokens`	`int`	Completion tokens.
`total_tokens`	`int`	Total tokens for the call.

Public API stability

openextract.__all__ is the public Python API surface. Modules and helpers whose names start with _, including openextract._extract and openextract._cli, are internal implementation details. The CLI command is also user-facing and follows the compatibility notes below even though it is not exported from __all__.

API	Status for 1.0	Notes
`extract`	Stable	Primary synchronous API. Signature, return type, media input forms, retry behavior, and public exception categories are intended to carry into 1.0 unchanged.
`extract_async`	Stable	Async sibling of `extract`; same input contract and retry behavior, with `Agent.run` instead of `run_sync`.
`extract_with_usage`	Stable	Usage-returning sync API. The `(output, Usage)` tuple shape is stable; exact token values depend on provider reporting.
`extract_with_usage_async`	Stable	Async sibling of `extract_with_usage`; same tuple shape and retry behavior.
`extract_many`	Provisional	Batch return ordering, option validation, and `return_exceptions` semantics are intended to remain, but pre-1.0 work should clarify behavior when called inside an active event loop.
`extract_many_async`	Provisional	Async batch API with the same return shape, option constraints, and per-item retry behavior as `extract_many`.
`Usage`	Stable	Frozen dataclass with `input_tokens`, `output_tokens`, and `total_tokens`. New fields, if ever needed, should be additive.
`ExtractionError`	Stable	Base class for all public `openextract` exceptions. Catch this for a broad fallback.
`UrlFetchError`	Stable	Raised for URL fetch and URL safety failures. Message wording may improve, but the exception type is stable.
`SchemaValidationError`	Stable	Raised when model output cannot be validated against the requested schema.
`ModelError`	Stable	Raised for provider/model API failures. Provider-specific classifiers may expand without changing the public type.
`ProviderNotInstalledError`	Stable	Raised when the requested model provider extra is missing. Install hints may become more specific as providers are added.
`openextract` CLI	Provisional	The command, core flags, JSON output, stderr error reporting, provider-install exit code `6`, and partial-batch exit code `7` are intended to remain.

No pre-1.0 signature changes are currently proposed for stable symbols. Known pre-1.0 follow-ups are limited to active-event-loop behavior.

Compatibility and deprecation policy

openextract follows semantic-versioning intent, with extra care while the project is still pre-1.0:

Public API: openextract.__all__ is the public Python API. The documented CLI arguments and exit codes, supported optional extras, documented environment variables, and documented input/output behavior are also user-facing compatibility surfaces.
Private API: modules, functions, classes, and constants whose names start with _ are internal unless they are explicitly documented here. They may change without a deprecation period.
Patch releases: should fix bugs, documentation, packaging, provider error classification, or security issues without intentionally breaking public API.
Minor releases before 1.0: may make breaking public API changes when they are needed for correctness, security, or a clearer long-term contract. These changes must be called out in CHANGELOG.md as breaking changes.
Major releases after 1.0: are the normal place for breaking public API removals or incompatible behavior changes.

Deprecated public APIs should remain available until at least the next minor release before 1.0, unless keeping them would create a security, correctness, or maintenance risk. After 1.0, deprecated public APIs should remain available until the next major release. Deprecations should be documented in CHANGELOG.md with the replacement path and the earliest expected removal version when that is known.

Provider behavior depends partly on pydantic-ai and provider SDKs. Upstream model availability, credential requirements, supported media types, token usage reporting, and provider-specific error shapes can change outside an openextract release. openextract aims to keep its own public contract stable, but provider-specific compatibility notes may be updated as upstream behavior changes.

Python support follows requires-python in pyproject.toml; the current minimum is Python 3.12. Dropping support for a Python minor version is a breaking change and should be announced in CHANGELOG.md.

Security

URL fetching and SSRF

When input_file is an http:// or https:// URL, openextract fetches it directly. To reduce server-side request forgery risk when callers pass untrusted URLs, the fetcher refuses any URL whose host resolves to a non-public address — private RFC 1918 ranges, loopback, link-local (including the 169.254.169.254 cloud-metadata endpoint), multicast, and reserved ranges, for both IPv4 and IPv6 (including IPv4-mapped IPv6 like ::ffff:127.0.0.1). The host is re-validated at every redirect hop, so an attacker cannot use a public URL that redirects to an internal one.

For workflows that legitimately need to fetch internal URLs (testing against localhost, on-prem services, etc.), set the OPENEXTRACT_ALLOW_PRIVATE_URLS environment variable to 1, true, or yes to disable the check.

Tune fetch behavior with:

OPENEXTRACT_URL_TIMEOUT — HTTP timeout in seconds (default 30)
OPENEXTRACT_MAX_REDIRECTS — maximum redirect hops (default 10)

Invalid or non-positive values fall back to the defaults. If you need a one-off fetch from an internal host without disabling validation globally, fetch the bytes with your own HTTP client and pass them to extract() as bytes/file-like with an explicit media_type.

Note: host validation is best-effort; it does not defend against DNS rebinding (where the host resolves to different IPs across calls). Treat URL-based extraction of untrusted input as a privileged operation.

Reporting vulnerabilities

See SECURITY.md.

Development

git clone https://github.com/Mellow-Artificial-Intelligence/openextract.git
cd openextract
uv sync --dev

uv run pytest --cov=openextract            # tests + coverage
uv run ruff check .                        # lint (Astral ruff)
uv run ruff format --check .               # format check
uv run ty check                            # types (Astral ty)

CI runs the test suite on every PR and fails if total coverage drops below 100%.

See CONTRIBUTING.md for the full contributor guide.

Roadmap

The project roadmap lives in the GitHub Wiki.

License

MIT © Cole McIntosh

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.9.0

Jul 13, 2026

0.8.0

Jun 2, 2026

0.7.0

May 23, 2026

0.6.0

May 17, 2026

0.5.0

May 16, 2026

0.4.0

May 16, 2026

0.3.2

May 5, 2026

0.3.1

Apr 11, 2026

0.3.0

Apr 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openextract-0.9.0.tar.gz (367.5 kB view details)

Uploaded Jul 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openextract-0.9.0-py3-none-any.whl (20.9 kB view details)

Uploaded Jul 13, 2026 Python 3

File details

Details for the file openextract-0.9.0.tar.gz.

File metadata

Download URL: openextract-0.9.0.tar.gz
Upload date: Jul 13, 2026
Size: 367.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for openextract-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`190239bd7282a81b5d1103200117b3061c4cadf1cf157bfdd0483af4bd0b258e`
MD5	`781e2017f025dab2798cb695ab99022d`
BLAKE2b-256	`7b0c049fb37652251ab08b860c2d9b52c0f28bc895e5b73233ac82e7b0849b73`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openextract-0.9.0.tar.gz:

Publisher: release.yml on Mellow-Artificial-Intelligence/openextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openextract-0.9.0.tar.gz
- Subject digest: 190239bd7282a81b5d1103200117b3061c4cadf1cf157bfdd0483af4bd0b258e
- Sigstore transparency entry: 2163492873
- Sigstore integration time: Jul 13, 2026
Source repository:
- Permalink: Mellow-Artificial-Intelligence/openextract@de7e6639593bea09c373861b0d3c4c2de3d2083c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Mellow-Artificial-Intelligence
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@de7e6639593bea09c373861b0d3c4c2de3d2083c
- Trigger Event: push

File details

Details for the file openextract-0.9.0-py3-none-any.whl.

File metadata

Download URL: openextract-0.9.0-py3-none-any.whl
Upload date: Jul 13, 2026
Size: 20.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for openextract-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c43bad6a5316c2ffc6937c9f2928dccc95557f06c2d4cc2204b95fd96f1439cb`
MD5	`4c9302e6767b9b8846e19b6715f5473a`
BLAKE2b-256	`a83cb34a87d0c00e13e06d714562e58e3e0ac9088f1627c771bcd4c0b9f7c03e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openextract-0.9.0-py3-none-any.whl:

Publisher: release.yml on Mellow-Artificial-Intelligence/openextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openextract-0.9.0-py3-none-any.whl
- Subject digest: c43bad6a5316c2ffc6937c9f2928dccc95557f06c2d4cc2204b95fd96f1439cb
- Sigstore transparency entry: 2163492963
- Sigstore integration time: Jul 13, 2026
Source repository:
- Permalink: Mellow-Artificial-Intelligence/openextract@de7e6639593bea09c373861b0d3c4c2de3d2083c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Mellow-Artificial-Intelligence
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@de7e6639593bea09c373861b0d3c4c2de3d2083c
- Trigger Event: push

openextract 0.9.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

openextract

Features

Installation

Quick start

Usage

Local files

Bytes or file-like objects

Retry on transient model errors

Inspecting token usage

Choosing a model

Command line

Examples

Error handling

API reference

extract(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)

extract_async(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)

extract_with_usage(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)

extract_with_usage_async(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)

extract_many(schema, model, input_files, instructions=None, *, media_type=None, max_concurrency=5, return_exceptions=False, max_retries=0, retry_backoff=1.0)

extract_many_async(schema, model, input_files, instructions=None, *, media_type=None, max_concurrency=5, return_exceptions=False, max_retries=0, retry_backoff=1.0)

Usage

Public API stability

Compatibility and deprecation policy

Security

URL fetching and SSRF

Reporting vulnerabilities

Development

Roadmap

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`extract(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)`

`extract_async(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)`

`extract_with_usage(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)`

`extract_with_usage_async(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)`

`extract_many(schema, model, input_files, instructions=None, *, media_type=None, max_concurrency=5, return_exceptions=False, max_retries=0, retry_backoff=1.0)`

`extract_many_async(schema, model, input_files, instructions=None, *, media_type=None, max_concurrency=5, return_exceptions=False, max_retries=0, retry_backoff=1.0)`

`Usage`