Extract structured data from documents, images, audio, and video using LLMs
Project description
openextract
Extract structured data from documents, images, audio, and video using LLMs.
Documentation · PyPI · Changelog · Issues
openextract turns any document, image, audio, or video file into a typed Pydantic model in a single function call. Point it at a local path or a URL, pass a schema, and get back a validated object you can use directly in your code.
Features
- Type-safe output. Define your shape with Pydantic; get back a validated instance.
- One function, many modalities. Documents (PDF, DOCX), images, audio, and video.
- Local files or URLs. Pass a path or an
https://URL —openextracthandles fetching. - Bring your own model. OpenAI, Google, and Ollama supported out of the box via
pydantic-ai. - Explicit error handling. Distinct exceptions for URL fetch, schema validation, and model errors.
- 100% test coverage, enforced in CI.
Installation
uv add openextract
Or with pip:
pip install openextract
Requires Python 3.12+.
Quick start
from pydantic import BaseModel
from openextract import extract
class PdfInfo(BaseModel):
summary: str
language: str
result = extract(
schema=PdfInfo,
model="openai:gpt-5",
input_file="https://example.com/document.pdf",
instructions="Return a two-sentence summary and the document's primary language.",
)
print(result.summary)
print(result.language)
result is a fully-validated PdfInfo instance — not a dict, not a string.
Usage
Local files
result = extract(
schema=PdfInfo,
model="openai:gpt-5",
input_file="./reports/q4.pdf",
)
Choosing a model
model follows the pydantic-ai provider prefix convention:
| Provider | Example identifier |
|---|---|
| OpenAI | openai:gpt-5 |
google-gla:gemini-2.5-pro |
|
| Ollama | ollama:llama3 |
Set the corresponding provider credentials in your environment (e.g. OPENAI_API_KEY). openextract loads .env automatically.
Error handling
from openextract import (
extract,
UrlFetchError,
SchemaValidationError,
ModelError,
ExtractionError,
)
try:
result = extract(schema=PdfInfo, model="openai:gpt-5", input_file=url)
except UrlFetchError:
... # The URL could not be fetched
except SchemaValidationError:
... # The model's output did not match your schema
except ModelError:
... # The model provider returned an error
except ExtractionError:
... # Any other extraction failure (base class)
All openextract exceptions inherit from ExtractionError, so you can catch it as a single fallback if you prefer.
API reference
extract(schema, model, input_file, instructions=None)
| Argument | Type | Description |
|---|---|---|
schema |
type[BaseModel] |
A Pydantic model class describing the desired output shape. |
model |
str |
A pydantic-ai model identifier (e.g. "openai:gpt-5"). |
input_file |
str |
A local file path or an https:// URL. |
instructions |
str | None |
Optional natural-language guidance for the model. |
Returns an instance of schema.
Development
git clone https://github.com/Mellow-Artificial-Intelligence/openextract.git
cd openextract
uv sync --dev
uv run pytest --cov=openextract # tests + coverage
uv run ruff check . # lint
uv run ruff format --check . # format check
CI runs the test suite on every PR and fails if total coverage drops below 100%.
See CONTRIBUTING.md for the full contributor guide.
License
MIT © Cole McIntosh
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file openextract-0.4.0.tar.gz.
File metadata
- Download URL: openextract-0.4.0.tar.gz
- Upload date:
- Size: 88.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de8dbaec845bb40631e7d17dab2d9dc25ce450722cebe3647eb14c86e984623b
|
|
| MD5 |
b3ae9c56fc07f08915f7b984a42654d8
|
|
| BLAKE2b-256 |
d437b20279c718d75bd53da960051ad114354c9246bb39a5cced4e9c073e5580
|
Provenance
The following attestation bundles were made for openextract-0.4.0.tar.gz:
Publisher:
release.yml on Mellow-Artificial-Intelligence/openextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openextract-0.4.0.tar.gz -
Subject digest:
de8dbaec845bb40631e7d17dab2d9dc25ce450722cebe3647eb14c86e984623b - Sigstore transparency entry: 1554576254
- Sigstore integration time:
-
Permalink:
Mellow-Artificial-Intelligence/openextract@1c7802c9b009da3eb62dde9e7fa1f588b0c0f6a1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Mellow-Artificial-Intelligence
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1c7802c9b009da3eb62dde9e7fa1f588b0c0f6a1 -
Trigger Event:
push
-
Statement type:
File details
Details for the file openextract-0.4.0-py3-none-any.whl.
File metadata
- Download URL: openextract-0.4.0-py3-none-any.whl
- Upload date:
- Size: 6.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea803a39aeeb072173b388b58afb42f3881a720ef3eb8211fa314e71870859ac
|
|
| MD5 |
3f22a1219d5312681e2f288c07a42074
|
|
| BLAKE2b-256 |
05c67ed5914a59b421fd8d3a488d45bed5a29c9da3a85448eb756f804bf6ec13
|
Provenance
The following attestation bundles were made for openextract-0.4.0-py3-none-any.whl:
Publisher:
release.yml on Mellow-Artificial-Intelligence/openextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
openextract-0.4.0-py3-none-any.whl -
Subject digest:
ea803a39aeeb072173b388b58afb42f3881a720ef3eb8211fa314e71870859ac - Sigstore transparency entry: 1554576264
- Sigstore integration time:
-
Permalink:
Mellow-Artificial-Intelligence/openextract@1c7802c9b009da3eb62dde9e7fa1f588b0c0f6a1 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Mellow-Artificial-Intelligence
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@1c7802c9b009da3eb62dde9e7fa1f588b0c0f6a1 -
Trigger Event:
push
-
Statement type: