Structured extraction framework over Pydantic AI Agent with JSON Schema & Pydantic model outputs.
Project description
nextract
nextract is a small, pragmatic framework for structured data extraction from files using the Pydantic AI Agent. It focuses on clean boundaries, strong typing, and JSON Schema/Pydantic-driven outputs—while keeping file handling simple and predictable.
Scope of this build
Uses Pydantic AI Agent only.
Takes local file paths and feeds content to the Agent:
- Text files are read as text and wrapped in delimiters.
- PDFs and images are attached as binary bytes.
- Office docs (
.doc/.docx/.ppt/.pptx) are converted to PDF first when a converter is available.- Excel files:
.xlsxis extracted to TSV (in-process);.xlsattempts CSV via LibreOffice/unoconv.No OCR, no large-file chunking yet (TODO).
Returns a
dictby default, or a Pydantic model instance if you pass a model and request it.Tracing via structlog; usage & cost estimation from Agent usage + a simple model pricing table.
Table of Contents
Features
-
Structured extraction for small files with:
- JSON Schema (output as
dict[str, Any]), or - Pydantic v2 models (output as dict by default; optional model instance).
- JSON Schema (output as
-
Pydantic AI Agent integration:
- Raw binary attachments for PDFs/images.
- StructuredDict for JSON Schema outputs.
- Usage metrics retrieved from the run.
-
Batch mode runs one file per Agent call in parallel.
-
Cost estimation via a simple pricing map (optional).
-
Structlog logging to console.
-
ZIP files: extract to
/tmpand process each contained file “as-is”.
What’s in / out of scope
Supported file types
- Read Text Directly:
.txt, .md, .csv, .tsv, .xls, .xlsx, .json, .xml, .yaml, .yml, .html, .htm
-
Upload Directly (binary): Images (
.png,.jpg,.jpeg,.webp,.gif,.bmp,.tiff), PDF (.pdf) -
ZIP: Extracted to
/tmp/nextract-zip-<name>and each file inside is processed “as-is”. -
Accepted as binary, converted to PDF before uploading to LLMs
.doc,.docx,.ppt,.pptx
Not supported for now
- Audio/Video processing
- OCR for scanned PDFs/images
- Large-file chunking & merging (design is stubbed; not implemented)
Installation
# Install from PyPI
pip install nextract
# Or install from source for development
git clone https://github.com/your-username/nextract.git
cd nextract
pip install -e .[dev]
Python: 3.10+
Quick Start
JSON Schema output (default dict)
from nextract import extract
schema = {
"title": "Invoice",
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"date": {"type": "string"},
"total": {"type": "number"}
},
"required": ["invoice_number", "total"]
}
res = extract(
files=["./docs/invoice.pdf"],
schema_or_model=schema,
user_prompt="Extract the invoice fields exactly as defined.",
include_extra=True, # adds a top-level `extra` bag for helpful unmodeled fields
)
print(res["data"]) # dict[str, Any] matching your schema (+ optional `extra`)
print(res["report"]) # model, usage, cost_estimate_usd, warnings
Pydantic model output
from pydantic import BaseModel
from nextract import extract
class Invoice(BaseModel):
invoice_number: str
date: str | None = None
total: float
res = extract(
files=["./docs/invoice.pdf"],
schema_or_model=Invoice,
user_prompt="Extract the invoice fields."
# include_extra is ignored for Pydantic model mode
)
# Default behavior returns a dict
print(res["data"]) # -> {'invoice_number': '...', 'date': '...', 'total': ...}
To get the Pydantic model instance instead of a dict:
res = extract(
files=["./docs/invoice.pdf"],
schema_or_model=Invoice,
user_prompt="Extract the invoice fields.",
return_pydantic=True,
)
invoice_obj = res["data"] # -> Invoice instance
Batch extraction (parallel)
Process each file independently (one Agent call per file):
from nextract import batch_extract
schema = {
"title": "DocSummary",
"type": "object",
"properties": {"title": {"type": "string"}, "summary": {"type": "string"}},
"required": ["title"]
}
res = batch_extract(
batch=["./a.pdf", "./b.png", "./c.txt"], # or [["./a1.pdf","./a2.pdf"], ["./b1.pdf"]] to group
schema_or_model=schema,
user_prompt="Summarize each document with title + summary.",
include_extra=False,
max_concurrency=4,
)
# result is a dict keyed by the first file in each item
print(res.keys()) # -> {"./a.pdf", "./b.png", "./c.txt"}
CLI
# JSON Schema
nextract extract ./invoice.pdf \
--schema ./invoice.schema.json \
--prompt "Extract the invoice fields." \
--include-extra
# Pydantic model (module:Class or module.Class)
nextract extract ./invoice.pdf \
--pydantic-model mypkg.models:Invoice
# Batch (parallel): schema mode
nextract batch ./a.pdf ./b.png ./c.txt \
--schema ./summary.schema.json \
--prompt "Summarize each document." \
--max-concurrency 4
Run
nextract --help,nextract extract --help, ornextract batch --helpfor more.
Configuration
Environment variables
| Variable | Default | Description |
|---|---|---|
NEXTRACT_MODEL |
openai:gpt-4o |
Pydantic AI model string (provider:model-id). |
NEXTRACT_MAX_CONCURRENCY |
4 |
Max parallel Agent calls in batch_extract. |
NEXTRACT_MAX_RUN_RETRIES |
5 |
Max retry attempts around Agent runs. |
NEXTRACT_PER_CALL_TIMEOUT_SECS |
120 |
Per-call timeout in seconds. |
NEXTRACT_PRICING |
(unset) | JSON map for cost estimation (see below). |
Also set provider credentials as expected by Pydantic AI for your chosen provider. Example: for OpenAI:
OPENAI_API_KEY=...
Pricing configuration
NEXTRACT_PRICING expects a JSON string like:
{
"openai:gpt-4o": { "input_per_1k": 0.005, "output_per_1k": 0.015 },
"openai:gpt-4.1-mini": { "input_per_1k": 0.003, "output_per_1k": 0.006 }
}
This is used to compute cost_estimate_usd from the Agent’s token usage. If the current model is missing in this map, cost will be null.
Model selection
By default, nextract uses openai:gpt-4o (vision-capable). You can override per process:
export NEXTRACT_MODEL="provider:model-id"
Or in Python, construct and pass your own RuntimeConfig (advanced—optional).
How it works
-
You pass file paths (single or multiple).
-
nextractprepares content:-
Textual files are read and injected as-is into the prompt, wrapped by:
--- BEGIN FILE: <name> (mime) --- <file contents> --- END FILE: <name> --- -
Binary files (PDFs, images, Office docs, others) are attached as binary parts using Pydantic AI’s
BinaryContent.
-
-
An Agent is created with:
-
a system prompt that instructs strict, schema-aligned extraction,
-
an
output_typeof either:- StructuredDict(JSON Schema) → outputs a dict, or
- Your Pydantic Model → outputs a model instance (dumped to dict by default).
-
-
For JSON Schema mode, a jsonschema validator runs as an output validator. On failure, the Agent is asked to retry briefly (limited rounds).
-
The result is validated again before returning. You get:
data(dict by default),reportwith usage and optional cost estimate.
File type handling
- Text:
.txt,.md,.json,.yaml,.yml,.xml,.csv,.tsv,.html,.htm→ Read as text (UTF‑8 with fallback), injected verbatim with file delimiters. - Excel:
.xlsx(parsed to TSV via in‑process XML),.xls(CSV via CLI if available; else raw fallback) → Read as text and injected like other textual files. Best‑effort extraction (no styling/formatting). - PDF / Images:
.pdf,.png,.jpg,.jpeg,.webp,.gif,.bmp,.tiff→ Attached as binary bytes for the model (vision-capable models recommended). - Office docs:
.doc,.docx,.ppt,.pptx→ Converted to PDF via LibreOffice/soffice or unoconv if available; on failure, attached as original binary. - ZIP: Extracted to
/tmp/nextract-zip-<name>; each inner file is processed as above. No nested recursion.
No OCR: Scanned PDFs/images are not OCR’d. If the model can’t read them natively, fields may be missing.
Office → PDF conversion
nextract attempts to convert .doc/.docx/.ppt/.pptx to PDF using system tools. These are external dependencies and are not installed via pip.
- Preferred:
soffice(LibreOffice) in headless mode. - Fallback:
unoconv(uses LibreOffice UNO).
Installation hints:
- macOS (Homebrew):
brew install --cask libreoffice- Ensure
sofficeis on yourPATH. If not, you can symlink:ln -s "/Applications/LibreOffice.app/Contents/MacOS/soffice" /usr/local/bin/soffice(adjust for Apple Silicon/Homebrew prefix)
- Ubuntu/Debian:
sudo apt-get update && sudo apt-get install -y libreoffice- Optional:
sudo apt-get install -y unoconv
- Fedora/CentOS/RHEL:
sudo dnf install -y libreoffice(oryumon older systems)- Optional: install
unoconvfrom your distro repos if available.
- Windows:
- Install LibreOffice from https://www.libreoffice.org/download/ and add
soffice.exeto yourPATH.
- Install LibreOffice from https://www.libreoffice.org/download/ and add
If neither tool is found, nextract logs a warning and falls back to attaching the original Office binary.
Examples & Few-shot Hints
You can supply examples to guide the model:
Programmatic (examples argument):
- Output-only examples:
list[dict] - Paired input/output:
list[tuple[str | None, dict]]
CLI (--examples JSON file):
-
Output-only examples:
[ { "invoice_number": "INV-001", "total": 123.45 } ]
-
Paired input/output (use a two-element array):
[ ["Item: Widget A, Total: 123.45", { "invoice_number": "INV-001", "total": 123.45 }] ]
“Extra” fields (JSON Schema mode):
If you pass include_extra=True, your schema is augmented with a top-level:
"extra": { "type": "object", "additionalProperties": true }
so the model can place relevant-but-unspecified fields there.
Return shape
All entry points return a dict with this structure:
{
"data": { /* your structured result (dict by default) */ },
"report": {
"model": "provider:model-id",
"files": ["..."],
"usage": {
"requests": 1,
"tool_calls": 0,
"input_tokens": 123,
"output_tokens": 456,
"details": { /* provider-dependent */ }
},
"cost_estimate_usd": 0.0123,
"warnings": []
}
}
- In Pydantic model mode,
datais still a dict unless you passedreturn_pydantic=True, in which case it’s the model instance.
Logging & Tracing
- Uses structlog; logs are JSON-formatted to stdout.
- Each extraction logs: model, files, usage, warnings, and cost estimate.
- You can set up your own logging before calling
extract/batch_extract. By default, the library initializes logging for you (toggle viasetup_logs=False).
Retries, Rate Limits & Timeouts
- Each Agent call is wrapped with exponential backoff (max attempts from
NEXTRACT_MAX_RUN_RETRIES). - Timeout per call is
NEXTRACT_PER_CALL_TIMEOUT_SECS(default 120s). - In batch mode, up to
NEXTRACT_MAX_CONCURRENCYtasks run in parallel (default 4).
Large files (TODO)
Planned design (not implemented in this build):
- Chunking (semantic/page) for large inputs.
- Per-chunk extraction with all fields optional, then merge into a full model validated against the target schema/model.
- Pluggable conflict resolution & optional provenance.
Limitations
- No OCR, no readability parsing for HTML.
- Office conversions require
soffice(LibreOffice) orunoconvinstalled; otherwise we fall back to attaching the original binary. - Office file understanding depends on the model/provider.
- Very large inputs may exceed model or provider limits.
- ZIP extraction writes to
/tmp/nextract-zip-<name>; these temp files are not auto-deleted by the library.
FAQ
Q: Which providers/models can I use?
A: Any supported by Pydantic AI Agent. Select via NEXTRACT_MODEL="provider:model-id" and set the provider’s expected credentials (e.g., OPENAI_API_KEY).
Q: What happens if schema validation keeps failing?
A: The Agent is asked to retry a couple of times. Final results are validated once more; if still invalid, you’ll see a final_validation_error entry under report.warnings.
Q: Can I store or inspect attachments that nextract sends?
A: This build sends raw text or binary bytes directly to the Agent. If you need durable storage or redaction, wrap nextract in your own pipeline.
Q: Can I get a Pydantic model out?
A: Yes—pass your model class to schema_or_model and set return_pydantic=True.
Development
Building & Testing
# Install development dependencies
pip install -e .[dev]
# Run tests
pytest
# Run linting
ruff check nextract
# Build package
python -m build
# Test installation
pip install dist/*.whl
CI/CD
This project uses GitHub Actions for continuous integration and automated PyPI publishing:
- CI: Runs on every push/PR, testing across Python 3.10-3.12
- Release: Automatically publishes to PyPI when GitHub releases are created
- Versioning: Managed statically in
pyproject.toml(current:0.0.1)
Creating a Release
- Bump
versioninpyproject.toml, commit, and push. - Create a GitHub release (with notes) — this triggers automatic PyPI publishing.
License
MIT. Feel free to adapt and extend.
Project Structure (for reference)
nextract/
├─ nextract/
│ ├─ __init__.py # exports extract, batch_extract
│ ├─ version.py
│ ├─ config.py # RuntimeConfig (model, concurrency, timeouts, pricing)
│ ├─ logging.py # structlog setup
│ ├─ mimetypes_map.py # simple mapping & helpers
│ ├─ schema.py # JSON Schema/Pydantic utilities
│ ├─ prompts.py # system prompt + examples builder
│ ├─ files.py # read-as-is; BinaryContent or text
│ ├─ pricing.py # usage → cost estimate
│ ├─ agent_runner.py # Agent wiring, retries, validation, metrics
│ ├─ core.py # public API: extract, batch_extract
│ └─ cli.py # Typer CLI
└─ pyproject.toml
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nextract-0.1.0.tar.gz.
File metadata
- Download URL: nextract-0.1.0.tar.gz
- Upload date:
- Size: 30.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86853ac5fb614fa467f603d52e66579b4b485b1872cfef8f699ef5c3939631af
|
|
| MD5 |
1ceaa28cf21ef7a5618aa48ae1c3a350
|
|
| BLAKE2b-256 |
a7e8b7ae4d24546b7940b2bac3024974030997c1ad635cad87c26ebb0169774d
|
File details
Details for the file nextract-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nextract-0.1.0-py3-none-any.whl
- Upload date:
- Size: 26.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6900c710ab53508ebdcf77d7e4d053eefb327ff08b2b36edc0f48a22aa1f10c8
|
|
| MD5 |
5125cc45b2b979834f47227f8fca5a02
|
|
| BLAKE2b-256 |
f000ac54b2d563add81d9574fe797dc5c17a7faea399c1705bb8a246b7fc639d
|