Extract structured data from documents, images, audio, and video using LLMs
Project description
openextract
Extract structured data from documents, images, audio, and video using LLMs.
Documentation · PyPI · Changelog · Issues
openextract turns any document, image, audio, or video file into a typed Pydantic model in a single function call. Point it at a local path or a URL, pass a schema, and get back a validated object you can use directly in your code.
Features
- Type-safe output. Define your shape with Pydantic; get back a validated instance.
- One function, many modalities. Documents (PDF, DOCX), images, audio, and video.
- Local files or URLs. Pass a path or an
https://URL —openextracthandles fetching. - Bring your own model. OpenAI, Anthropic, Google, AWS Bedrock, xAI, Cohere, Hugging Face, Groq, Cerebras, Mistral, and Ollama supported out of the box via
pydantic-ai. - Explicit error handling. Distinct exceptions for URL fetch, schema validation, and model errors.
- 100% test coverage, enforced in CI.
Installation
uv add openextract
Or with pip:
pip install openextract
Requires Python 3.12+.
Quick start
from pydantic import BaseModel
from openextract import extract
class PdfInfo(BaseModel):
summary: str
language: str
result = extract(
schema=PdfInfo,
model="openai:gpt-5",
input_file="https://example.com/document.pdf",
instructions="Return a two-sentence summary and the document's primary language.",
)
print(result.summary)
print(result.language)
result is a fully-validated PdfInfo instance — not a dict, not a string.
Usage
Local files
result = extract(
schema=PdfInfo,
model="openai:gpt-5",
input_file="./reports/q4.pdf",
)
Bytes or file-like objects
result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=pdf_bytes, media_type="application/pdf")
# A file-like object with .read() works too; pass media_type explicitly:
result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=open("q4.pdf", "rb"), media_type="application/pdf")
Retry on transient model errors
result = extract(
schema=PdfInfo,
model="openai:gpt-5",
input_file="./reports/q4.pdf",
max_retries=3,
)
max_retries defaults to 0 (single attempt). When set, extract retries only on ModelError and sleeps retry_backoff * (2 ** attempt) seconds (with up to 25% jitter) between attempts. retry_backoff defaults to 1.0 second.
Inspecting token usage
Use extract_with_usage when you want token counts alongside the extracted output (for cost tracking, logging, etc.).
from openextract import extract_with_usage
result, usage = extract_with_usage(
schema=PdfInfo,
model="openai:gpt-5",
input_file="./reports/q4.pdf",
)
print(result.summary)
print(f"tokens: {usage.input_tokens} in / {usage.output_tokens} out / {usage.total_tokens} total")
usage is a frozen Usage dataclass with input_tokens, output_tokens, and total_tokens fields.
Choosing a model
model follows the pydantic-ai provider prefix convention:
| Provider | Example identifier |
|---|---|
| OpenAI | openai:gpt-5 |
| Anthropic | anthropic:claude-sonnet-4 |
google-gla:gemini-2.5-pro |
|
| AWS Bedrock | bedrock:anthropic.claude-sonnet-4-20250514-v1:0 |
| xAI | xai:grok-4 |
| Cohere | cohere:command-r-plus |
| Hugging Face | huggingface:meta-llama/Llama-3.3-70B-Instruct |
| Groq | groq:llama-3.3-70b-versatile |
| Cerebras | cerebras:llama3.1-70b |
| Mistral | mistral:mistral-large-latest |
| OpenRouter | openrouter:anthropic/claude-sonnet-4 |
| Outlines | outlines:transformers/meta-llama/Llama-3.2-1B-Instruct |
| Ollama | ollama:llama3 |
Ollama and Cerebras work via the openai-compatible code path — no dedicated extra is required for either.
Set the corresponding provider credentials in your environment (e.g. OPENAI_API_KEY). openextract loads .env automatically.
OpenRouter and Cerebras are openai-compatible (they go through the openai client under the hood), so their errors are already classified via the existing openai path — no separate exception handling is needed.
Outlines runs models locally (via HuggingFace transformers, llama-cpp, MLX, vLLM, or SGLang) and enforces JSON-schema-conforming output at the token level. Install it separately alongside the backend you want, for example pip install pydantic-ai-slim[outlines-transformers].
Command line
openextract ships with a CLI for one-shot extractions from the shell.
openextract ./reports/q4.pdf \
--schema mypkg.schemas:Invoice \
--model openai:gpt-5 \
--instructions "Pull totals and line items." \
--output json
<input_file>is a positional argument; a local path orhttps://URL.--schemais a Python import path of the formmodule:ClassNameresolving to a Pydantic model.--modelis apydantic-aimodel identifier.--instructionsis optional natural-language guidance.--outputisjson(default; printsmodel_dump_json(indent=2)) orrepr.
Exit codes: 0 success, 2 URL fetch error, 3 schema validation error, 4 model error,
5 other extraction error, 1 any other failure (including bad --schema paths).
Examples
Runnable scripts live in the examples/ directory. Each one takes the input path as the first argument and prints a JSON dump of the validated result:
| Script | What it does |
|---|---|
invoice_extraction.py |
PDF invoice -> structured line items |
receipt_extraction.py |
receipt image -> merchant, items, totals |
meeting_notes.py |
audio -> summary, decisions, action items |
Run any example with uv once your provider credentials (e.g. OPENAI_API_KEY) are set:
uv run python examples/invoice_extraction.py ./invoices/q4.pdf
See the examples/ directory for the full source.
Error handling
from openextract import (
extract,
UrlFetchError,
SchemaValidationError,
ModelError,
ExtractionError,
)
try:
result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=url)
except UrlFetchError:
... # The URL could not be fetched
except SchemaValidationError:
... # The model's output did not match your schema
except ModelError:
... # The model provider returned an error
except ExtractionError:
... # Any other extraction failure (base class)
All openextract exceptions inherit from ExtractionError, so you can catch it as a single fallback if you prefer.
API reference
extract(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)
| Argument | Type | Description |
|---|---|---|
schema |
type[BaseModel] |
A Pydantic model class describing the desired output shape. |
model |
str |
A pydantic-ai model identifier (e.g. "openai:gpt-5"). |
input_file |
str | bytes | BinaryIO |
A local file path, an https:// URL, raw bytes, or a binary file-like object with a .read() method. |
instructions |
str | None |
Optional natural-language guidance for the model. |
media_type |
str | None (keyword-only) |
MIME type. Required for bytes and file-like inputs; overrides the guessed type for str inputs when provided. |
max_retries |
int (keyword-only) |
Extra attempts after a ModelError. Defaults to 0 (no retry). |
retry_backoff |
float (keyword-only) |
Base seconds for exponential backoff with jitter between retries. |
Returns an instance of schema.
Security
URL fetching and SSRF
When input_file is an http:// or https:// URL, openextract fetches it
directly. To reduce server-side request forgery risk when callers pass
untrusted URLs, the fetcher refuses any URL whose host resolves to a
non-public address — private RFC 1918 ranges, loopback, link-local (including
the 169.254.169.254 cloud-metadata endpoint), multicast, and reserved
ranges, for both IPv4 and IPv6 (including IPv4-mapped IPv6 like
::ffff:127.0.0.1). The host is re-validated at every redirect hop, so an
attacker cannot use a public URL that redirects to an internal one.
For workflows that legitimately need to fetch internal URLs (testing
against localhost, on-prem services, etc.), set the
OPENEXTRACT_ALLOW_PRIVATE_URLS environment variable to 1, true, or
yes to disable the check. If you need a one-off fetch from an internal
host without disabling validation globally, fetch the bytes with your own
HTTP client and pass them to extract() as bytes/file-like with an
explicit media_type.
Note: host validation is best-effort; it does not defend against DNS rebinding (where the host resolves to different IPs across calls). Treat URL-based extraction of untrusted input as a privileged operation.
Reporting vulnerabilities
See SECURITY.md.
Development
git clone https://github.com/Mellow-Artificial-Intelligence/openextract.git
cd openextract
uv sync --dev
uv run pytest --cov=openextract # tests + coverage
uv run ruff check . # lint
uv run ruff format --check . # format check
CI runs the test suite on every PR and fails if total coverage drops below 100%.
See CONTRIBUTING.md for the full contributor guide.
License
MIT © Cole McIntosh
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openextract-0.7.0.tar.gz.
File metadata
- Download URL: openextract-0.7.0.tar.gz
- Upload date:
- Size: 173.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fa45e7670189f5031e7e5a93ec9ebcae3a853872e4af54396b4ab5ad20217d88
|
|
| MD5 |
bc410caaa3313bc14477b97eadc86d2c
|
|
| BLAKE2b-256 |
6a0c358115c26b94a3ae0d3912844351a8503416c9f4d68e44f56bcc6d225643
|
Provenance
The following attestation bundles were made for openextract-0.7.0.tar.gz:
Publisher:
release.yml on Mellow-Artificial-Intelligence/openextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openextract-0.7.0.tar.gz -
Subject digest:
fa45e7670189f5031e7e5a93ec9ebcae3a853872e4af54396b4ab5ad20217d88 - Sigstore transparency entry: 1610405439
- Sigstore integration time:
-
Permalink:
Mellow-Artificial-Intelligence/openextract@0686da14c2f8d76511440923179e6974273b85b6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Mellow-Artificial-Intelligence
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0686da14c2f8d76511440923179e6974273b85b6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file openextract-0.7.0-py3-none-any.whl.
File metadata
- Download URL: openextract-0.7.0-py3-none-any.whl
- Upload date:
- Size: 14.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
159a37262af8dce7f36e283c7910ee27ec5b8eb55c58c2c8121ce47fa0af1a50
|
|
| MD5 |
1c2aabad73844d2ab46cdebb24105439
|
|
| BLAKE2b-256 |
243807a1ae68c9c95cc5895a6e652e20b19ce72c469e67f2837a73054897720f
|
Provenance
The following attestation bundles were made for openextract-0.7.0-py3-none-any.whl:
Publisher:
release.yml on Mellow-Artificial-Intelligence/openextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openextract-0.7.0-py3-none-any.whl -
Subject digest:
159a37262af8dce7f36e283c7910ee27ec5b8eb55c58c2c8121ce47fa0af1a50 - Sigstore transparency entry: 1610405602
- Sigstore integration time:
-
Permalink:
Mellow-Artificial-Intelligence/openextract@0686da14c2f8d76511440923179e6974273b85b6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Mellow-Artificial-Intelligence
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0686da14c2f8d76511440923179e6974273b85b6 -
Trigger Event:
push
-
Statement type: