Skip to main content

Extract structured data from documents, images, audio, and video using LLMs

Project description

openextract

Extract structured data from documents, images, audio, and video using LLMs.

PyPI version Python versions License: MIT CI Coverage Ruff Downloads

Documentation · PyPI · Changelog · Issues


openextract turns any document, image, audio, or video file into a typed Pydantic model in a single function call. Point it at a local path or a URL, pass a schema, and get back a validated object you can use directly in your code.

Features

  • Type-safe output. Define your shape with Pydantic; get back a validated instance.
  • One function, many modalities. Documents (PDF, DOCX), images, audio, and video.
  • Local files or URLs. Pass a path or an https:// URL — openextract handles fetching.
  • Bring your own model. OpenAI, Google, and Ollama supported out of the box via pydantic-ai.
  • Explicit error handling. Distinct exceptions for URL fetch, schema validation, and model errors.
  • 100% test coverage, enforced in CI.

Installation

uv add openextract

Or with pip:

pip install openextract

Requires Python 3.12+.

Quick start

from pydantic import BaseModel
from openextract import extract


class PdfInfo(BaseModel):
    summary: str
    language: str


result = extract(
    schema=PdfInfo,
    model="openai:gpt-5",
    input_file="https://example.com/document.pdf",
    instructions="Return a two-sentence summary and the document's primary language.",
)

print(result.summary)
print(result.language)

result is a fully-validated PdfInfo instance — not a dict, not a string.

Usage

Local files

result = extract(
    schema=PdfInfo,
    model="openai:gpt-5",
    input_file="./reports/q4.pdf",
)

Choosing a model

model follows the pydantic-ai provider prefix convention:

Provider Example identifier
OpenAI openai:gpt-5
Google google-gla:gemini-2.5-pro
Ollama ollama:llama3

Set the corresponding provider credentials in your environment (e.g. OPENAI_API_KEY). openextract loads .env automatically.

Error handling

from openextract import (
    extract,
    UrlFetchError,
    SchemaValidationError,
    ModelError,
    ExtractionError,
)

try:
    result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=url)
except UrlFetchError:
    ...  # The URL could not be fetched
except SchemaValidationError:
    ...  # The model's output did not match your schema
except ModelError:
    ...  # The model provider returned an error
except ExtractionError:
    ...  # Any other extraction failure (base class)

All openextract exceptions inherit from ExtractionError, so you can catch it as a single fallback if you prefer.

API reference

extract(schema, model, input_file, instructions=None)

Argument Type Description
schema type[BaseModel] A Pydantic model class describing the desired output shape.
model str A pydantic-ai model identifier (e.g. "openai:gpt-5").
input_file str A local file path or an https:// URL.
instructions str | None Optional natural-language guidance for the model.

Returns an instance of schema.

Development

git clone https://github.com/Mellow-Artificial-Intelligence/openextract.git
cd openextract
uv sync --dev

uv run pytest --cov=openextract            # tests + coverage
uv run ruff check .                        # lint
uv run ruff format --check .               # format check

CI runs the test suite on every PR and fails if total coverage drops below 100%.

See CONTRIBUTING.md for the full contributor guide.

License

MIT © Cole McIntosh

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openextract-0.4.0.tar.gz (88.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

openextract-0.4.0-py3-none-any.whl (6.2 kB view details)

Uploaded Python 3

File details

Details for the file openextract-0.4.0.tar.gz.

File metadata

  • Download URL: openextract-0.4.0.tar.gz
  • Upload date:
  • Size: 88.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openextract-0.4.0.tar.gz
Algorithm Hash digest
SHA256 de8dbaec845bb40631e7d17dab2d9dc25ce450722cebe3647eb14c86e984623b
MD5 b3ae9c56fc07f08915f7b984a42654d8
BLAKE2b-256 d437b20279c718d75bd53da960051ad114354c9246bb39a5cced4e9c073e5580

See more details on using hashes here.

Provenance

The following attestation bundles were made for openextract-0.4.0.tar.gz:

Publisher: release.yml on Mellow-Artificial-Intelligence/openextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file openextract-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: openextract-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 6.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openextract-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea803a39aeeb072173b388b58afb42f3881a720ef3eb8211fa314e71870859ac
MD5 3f22a1219d5312681e2f288c07a42074
BLAKE2b-256 05c67ed5914a59b421fd8d3a488d45bed5a29c9da3a85448eb756f804bf6ec13

See more details on using hashes here.

Provenance

The following attestation bundles were made for openextract-0.4.0-py3-none-any.whl:

Publisher: release.yml on Mellow-Artificial-Intelligence/openextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page