Extract structured data from documents, images, audio, and video using LLMs

These details have not been verified by PyPI

Project description

openextract

Extract structured data from documents, images, audio, and video using LLMs.

Documentation · PyPI · Changelog · Issues

openextract turns any document, image, audio, or video file into a typed Pydantic model in a single function call. Point it at a local path or a URL, pass a schema, and get back a validated object you can use directly in your code.

Features

Type-safe output. Define your shape with Pydantic; get back a validated instance.
One function, many modalities. Documents (PDF, DOCX), images, audio, and video.
Local files or URLs. Pass a path or an https:// URL — openextract handles fetching.
Bring your own model. OpenAI, Google, and Ollama supported out of the box via pydantic-ai.
Explicit error handling. Distinct exceptions for URL fetch, schema validation, and model errors.
100% test coverage, enforced in CI.

Installation

uv add openextract

Or with pip:

pip install openextract

Requires Python 3.12+.

Quick start

from pydantic import BaseModel
from openextract import extract


class PdfInfo(BaseModel):
    summary: str
    language: str


result = extract(
    schema=PdfInfo,
    model="openai:gpt-5",
    input_file="https://example.com/document.pdf",
    instructions="Return a two-sentence summary and the document's primary language.",
)

print(result.summary)
print(result.language)

result is a fully-validated PdfInfo instance — not a dict, not a string.

Usage

Local files

result = extract(
    schema=PdfInfo,
    model="openai:gpt-5",
    input_file="./reports/q4.pdf",
)

Choosing a model

model follows the pydantic-ai provider prefix convention:

Provider	Example identifier
OpenAI	`openai:gpt-5`
Google	`google-gla:gemini-2.5-pro`
Ollama	`ollama:llama3`

Set the corresponding provider credentials in your environment (e.g. OPENAI_API_KEY). openextract loads .env automatically.

Error handling

from openextract import (
    extract,
    UrlFetchError,
    SchemaValidationError,
    ModelError,
    ExtractionError,
)

try:
    result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=url)
except UrlFetchError:
    ...  # The URL could not be fetched
except SchemaValidationError:
    ...  # The model's output did not match your schema
except ModelError:
    ...  # The model provider returned an error
except ExtractionError:
    ...  # Any other extraction failure (base class)

All openextract exceptions inherit from ExtractionError, so you can catch it as a single fallback if you prefer.

API reference

`extract(schema, model, input_file, instructions=None)`

Argument	Type	Description
`schema`	`type[BaseModel]`	A Pydantic model class describing the desired output shape.
`model`	`str`	A `pydantic-ai` model identifier (e.g. `"openai:gpt-5"`).
`input_file`	`str`	A local file path or an `https://` URL.
`instructions`	`str \| None`	Optional natural-language guidance for the model.

Returns an instance of schema.

Development

git clone https://github.com/Mellow-Artificial-Intelligence/openextract.git
cd openextract
uv sync --dev

uv run pytest --cov=openextract            # tests + coverage
uv run ruff check .                        # lint
uv run ruff format --check .               # format check

CI runs the test suite on every PR and fails if total coverage drops below 100%.

See CONTRIBUTING.md for the full contributor guide.

License

MIT © Cole McIntosh

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.7.0

May 23, 2026

0.6.0

May 17, 2026

0.5.0

May 16, 2026

This version

0.4.0

May 16, 2026

0.3.2

May 5, 2026

0.3.1

Apr 11, 2026

0.3.0

Apr 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openextract-0.4.0.tar.gz (88.2 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

openextract-0.4.0-py3-none-any.whl (6.2 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file openextract-0.4.0.tar.gz.

File metadata

Download URL: openextract-0.4.0.tar.gz
Upload date: May 16, 2026
Size: 88.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openextract-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`de8dbaec845bb40631e7d17dab2d9dc25ce450722cebe3647eb14c86e984623b`
MD5	`b3ae9c56fc07f08915f7b984a42654d8`
BLAKE2b-256	`d437b20279c718d75bd53da960051ad114354c9246bb39a5cced4e9c073e5580`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openextract-0.4.0.tar.gz:

Publisher: release.yml on Mellow-Artificial-Intelligence/openextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openextract-0.4.0.tar.gz
- Subject digest: de8dbaec845bb40631e7d17dab2d9dc25ce450722cebe3647eb14c86e984623b
- Sigstore transparency entry: 1554576254
- Sigstore integration time: May 16, 2026
Source repository:
- Permalink: Mellow-Artificial-Intelligence/openextract@1c7802c9b009da3eb62dde9e7fa1f588b0c0f6a1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Mellow-Artificial-Intelligence
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1c7802c9b009da3eb62dde9e7fa1f588b0c0f6a1
- Trigger Event: push

File details

Details for the file openextract-0.4.0-py3-none-any.whl.

File metadata

Download URL: openextract-0.4.0-py3-none-any.whl
Upload date: May 16, 2026
Size: 6.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for openextract-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ea803a39aeeb072173b388b58afb42f3881a720ef3eb8211fa314e71870859ac`
MD5	`3f22a1219d5312681e2f288c07a42074`
BLAKE2b-256	`05c67ed5914a59b421fd8d3a488d45bed5a29c9da3a85448eb756f804bf6ec13`

See more details on using hashes here.

Provenance

The following attestation bundles were made for openextract-0.4.0-py3-none-any.whl:

Publisher: release.yml on Mellow-Artificial-Intelligence/openextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: openextract-0.4.0-py3-none-any.whl
- Subject digest: ea803a39aeeb072173b388b58afb42f3881a720ef3eb8211fa314e71870859ac
- Sigstore transparency entry: 1554576264
- Sigstore integration time: May 16, 2026
Source repository:
- Permalink: Mellow-Artificial-Intelligence/openextract@1c7802c9b009da3eb62dde9e7fa1f588b0c0f6a1
- Branch / Tag: refs/heads/main
- Owner: https://github.com/Mellow-Artificial-Intelligence
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@1c7802c9b009da3eb62dde9e7fa1f588b0c0f6a1
- Trigger Event: push

openextract 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

openextract

Features

Installation

Quick start

Usage

Local files

Choosing a model

Error handling

API reference

extract(schema, model, input_file, instructions=None)

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`extract(schema, model, input_file, instructions=None)`