Skip to main content

Extract structured data from documents, images, audio, and video using LLMs

Project description

openextract

Extract structured data from documents, images, audio, and video using LLMs.

PyPI version Python versions License: MIT CI Coverage Ruff Downloads

Documentation · PyPI · Changelog · Issues


openextract turns any document, image, audio, or video file into a typed Pydantic model in a single function call. Point it at a local path or a URL, pass a schema, and get back a validated object you can use directly in your code.

Features

  • Type-safe output. Define your shape with Pydantic; get back a validated instance.
  • One function, many modalities. Documents (PDF, DOCX), images, audio, and video.
  • Local files or URLs. Pass a path or an https:// URL — openextract handles fetching.
  • Bring your own model. OpenAI, Anthropic, Google, AWS Bedrock, xAI, Cohere, Hugging Face, Groq, Cerebras, Mistral, and Ollama supported out of the box via pydantic-ai.
  • Explicit error handling. Distinct exceptions for URL fetch, schema validation, and model errors.
  • 100% test coverage, enforced in CI.

Installation

uv add openextract

Or with pip:

pip install openextract

Requires Python 3.12+.

Quick start

from pydantic import BaseModel
from openextract import extract


class PdfInfo(BaseModel):
    summary: str
    language: str


result = extract(
    schema=PdfInfo,
    model="openai:gpt-5",
    input_file="https://example.com/document.pdf",
    instructions="Return a two-sentence summary and the document's primary language.",
)

print(result.summary)
print(result.language)

result is a fully-validated PdfInfo instance — not a dict, not a string.

Usage

Local files

result = extract(
    schema=PdfInfo,
    model="openai:gpt-5",
    input_file="./reports/q4.pdf",
)

Bytes or file-like objects

result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=pdf_bytes, media_type="application/pdf")
# A file-like object with .read() works too; pass media_type explicitly:
result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=open("q4.pdf", "rb"), media_type="application/pdf")

Retry on transient model errors

result = extract(
    schema=PdfInfo,
    model="openai:gpt-5",
    input_file="./reports/q4.pdf",
    max_retries=3,
)

max_retries defaults to 0 (single attempt). When set, extract retries only on ModelError and sleeps retry_backoff * (2 ** attempt) seconds (with up to 25% jitter) between attempts. retry_backoff defaults to 1.0 second.

Inspecting token usage

Use extract_with_usage when you want token counts alongside the extracted output (for cost tracking, logging, etc.).

from openextract import extract_with_usage

result, usage = extract_with_usage(
    schema=PdfInfo,
    model="openai:gpt-5",
    input_file="./reports/q4.pdf",
)

print(result.summary)
print(f"tokens: {usage.input_tokens} in / {usage.output_tokens} out / {usage.total_tokens} total")

usage is a frozen Usage dataclass with input_tokens, output_tokens, and total_tokens fields.

Choosing a model

model follows the pydantic-ai provider prefix convention:

Provider Example identifier
OpenAI openai:gpt-5
Anthropic anthropic:claude-sonnet-4
Google google-gla:gemini-2.5-pro
AWS Bedrock bedrock:anthropic.claude-sonnet-4-20250514-v1:0
xAI xai:grok-4
Cohere cohere:command-r-plus
Hugging Face huggingface:meta-llama/Llama-3.3-70B-Instruct
Groq groq:llama-3.3-70b-versatile
Cerebras cerebras:llama3.1-70b
Mistral mistral:mistral-large-latest
OpenRouter openrouter:anthropic/claude-sonnet-4
Outlines outlines:transformers/meta-llama/Llama-3.2-1B-Instruct
Ollama ollama:llama3

Ollama and Cerebras work via the openai-compatible code path — no dedicated extra is required for either.

Set the corresponding provider credentials in your environment (e.g. OPENAI_API_KEY). openextract loads .env automatically.

OpenRouter and Cerebras are openai-compatible (they go through the openai client under the hood), so their errors are already classified via the existing openai path — no separate exception handling is needed.

Outlines runs models locally (via HuggingFace transformers, llama-cpp, MLX, vLLM, or SGLang) and enforces JSON-schema-conforming output at the token level. Install it separately alongside the backend you want, for example pip install pydantic-ai-slim[outlines-transformers].

Command line

openextract ships with a CLI for one-shot extractions from the shell.

openextract ./reports/q4.pdf \
  --schema mypkg.schemas:Invoice \
  --model openai:gpt-5 \
  --instructions "Pull totals and line items." \
  --output json
  • <input_file> is a positional argument; a local path or https:// URL.
  • --schema is a Python import path of the form module:ClassName resolving to a Pydantic model.
  • --model is a pydantic-ai model identifier.
  • --instructions is optional natural-language guidance.
  • --output is json (default; prints model_dump_json(indent=2)) or repr.

Exit codes: 0 success, 2 URL fetch error, 3 schema validation error, 4 model error, 5 other extraction error, 1 any other failure (including bad --schema paths).

Examples

Runnable scripts live in the examples/ directory. Each one takes the input path as the first argument and prints a JSON dump of the validated result:

Script What it does
invoice_extraction.py PDF invoice -> structured line items
receipt_extraction.py receipt image -> merchant, items, totals
meeting_notes.py audio -> summary, decisions, action items

Run any example with uv once your provider credentials (e.g. OPENAI_API_KEY) are set:

uv run python examples/invoice_extraction.py ./invoices/q4.pdf

See the examples/ directory for the full source.

Error handling

from openextract import (
    extract,
    UrlFetchError,
    SchemaValidationError,
    ModelError,
    ExtractionError,
)

try:
    result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=url)
except UrlFetchError:
    ...  # The URL could not be fetched
except SchemaValidationError:
    ...  # The model's output did not match your schema
except ModelError:
    ...  # The model provider returned an error
except ExtractionError:
    ...  # Any other extraction failure (base class)

All openextract exceptions inherit from ExtractionError, so you can catch it as a single fallback if you prefer.

API reference

extract(schema, model, input_file, instructions=None, *, media_type=None, max_retries=0, retry_backoff=1.0)

Argument Type Description
schema type[BaseModel] A Pydantic model class describing the desired output shape.
model str A pydantic-ai model identifier (e.g. "openai:gpt-5").
input_file str | bytes | BinaryIO A local file path, an https:// URL, raw bytes, or a binary file-like object with a .read() method.
instructions str | None Optional natural-language guidance for the model.
media_type str | None (keyword-only) MIME type. Required for bytes and file-like inputs; overrides the guessed type for str inputs when provided.
max_retries int (keyword-only) Extra attempts after a ModelError. Defaults to 0 (no retry).
retry_backoff float (keyword-only) Base seconds for exponential backoff with jitter between retries.

Returns an instance of schema.

Security

URL fetching and SSRF

When input_file is an http:// or https:// URL, openextract fetches it directly. To reduce server-side request forgery risk when callers pass untrusted URLs, the fetcher refuses any URL whose host resolves to a non-public address — private RFC 1918 ranges, loopback, link-local (including the 169.254.169.254 cloud-metadata endpoint), multicast, and reserved ranges, for both IPv4 and IPv6 (including IPv4-mapped IPv6 like ::ffff:127.0.0.1). The host is re-validated at every redirect hop, so an attacker cannot use a public URL that redirects to an internal one.

For workflows that legitimately need to fetch internal URLs (testing against localhost, on-prem services, etc.), set the OPENEXTRACT_ALLOW_PRIVATE_URLS environment variable to 1, true, or yes to disable the check. If you need a one-off fetch from an internal host without disabling validation globally, fetch the bytes with your own HTTP client and pass them to extract() as bytes/file-like with an explicit media_type.

Note: host validation is best-effort; it does not defend against DNS rebinding (where the host resolves to different IPs across calls). Treat URL-based extraction of untrusted input as a privileged operation.

Reporting vulnerabilities

See SECURITY.md.

Development

git clone https://github.com/Mellow-Artificial-Intelligence/openextract.git
cd openextract
uv sync --dev

uv run pytest --cov=openextract            # tests + coverage
uv run ruff check .                        # lint
uv run ruff format --check .               # format check

CI runs the test suite on every PR and fails if total coverage drops below 100%.

See CONTRIBUTING.md for the full contributor guide.

License

MIT © Cole McIntosh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openextract-0.7.0.tar.gz (173.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openextract-0.7.0-py3-none-any.whl (14.7 kB view details)

Uploaded Python 3

File details

Details for the file openextract-0.7.0.tar.gz.

File metadata

  • Download URL: openextract-0.7.0.tar.gz
  • Upload date:
  • Size: 173.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for openextract-0.7.0.tar.gz
Algorithm Hash digest
SHA256 fa45e7670189f5031e7e5a93ec9ebcae3a853872e4af54396b4ab5ad20217d88
MD5 bc410caaa3313bc14477b97eadc86d2c
BLAKE2b-256 6a0c358115c26b94a3ae0d3912844351a8503416c9f4d68e44f56bcc6d225643

See more details on using hashes here.

Provenance

The following attestation bundles were made for openextract-0.7.0.tar.gz:

Publisher: release.yml on Mellow-Artificial-Intelligence/openextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openextract-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: openextract-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 14.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for openextract-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 159a37262af8dce7f36e283c7910ee27ec5b8eb55c58c2c8121ce47fa0af1a50
MD5 1c2aabad73844d2ab46cdebb24105439
BLAKE2b-256 243807a1ae68c9c95cc5895a6e652e20b19ce72c469e67f2837a73054897720f

See more details on using hashes here.

Provenance

The following attestation bundles were made for openextract-0.7.0-py3-none-any.whl:

Publisher: release.yml on Mellow-Artificial-Intelligence/openextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page