Structured extraction framework over Pydantic AI Agent with JSON Schema & Pydantic model outputs.

These details have not been verified by PyPI

Project description

nextract

nextract is a small, pragmatic framework for structured data extraction from files using the Pydantic AI Agent. It focuses on clean boundaries, strong typing, and JSON Schema/Pydantic-driven outputs—while keeping file handling simple and predictable.

Scope of this build

Uses Pydantic AI Agent only.

Takes local file paths and feeds content to the Agent:

Text files are read as text and wrapped in delimiters.

PDFs and images are attached as binary bytes.

Office docs (.doc/.docx/.ppt/.pptx) are converted to PDF first when a converter is available.

Excel files: .xlsx is extracted to TSV (in-process); .xls attempts CSV via LibreOffice/unoconv.

No OCR, no large-file chunking yet (TODO).

Returns a dict by default, or a Pydantic model instance if you pass a model and request it.

Tracing via structlog; usage & cost estimation from Agent usage + a simple model pricing table.

Features
What’s in / out of scope
Installation
Quick Start
CLI
Configuration
How it works
File type handling
Office → PDF conversion
Examples & Few-shot Hints
Return shape
Logging & Tracing
Retries, Rate Limits & Timeouts
Large files (TODO)
Limitations
FAQ
License

Features

Structured extraction for small files with:
- JSON Schema (output as dict[str, Any]), or
- Pydantic v2 models (output as dict by default; optional model instance).
Pydantic AI Agent integration:
- Raw binary attachments for PDFs/images.
- StructuredDict for JSON Schema outputs.
- Usage metrics retrieved from the run.
Batch mode runs one file per Agent call in parallel.
Cost estimation via a simple pricing map (optional).
Structlog logging to console.
ZIP files: extract to /tmp and process each contained file “as-is”.

What’s in / out of scope

Supported file types

Read Text Directly:

.txt, .md, .csv, .tsv, .xls, .xlsx, .json, .xml, .yaml, .yml, .html, .htm

Upload Directly (binary): Images (.png, .jpg, .jpeg, .webp, .gif, .bmp, .tiff), PDF (.pdf)
ZIP: Extracted to /tmp/nextract-zip-<name> and each file inside is processed “as-is”.
Accepted as binary, converted to PDF before uploading to LLMs .doc, .docx, .ppt, .pptx

Not supported for now

Audio/Video processing
OCR for scanned PDFs/images
Large-file chunking & merging (design is stubbed; not implemented)

Installation

# Install from PyPI
pip install nextract

# Or install from source for development
git clone https://github.com/your-username/nextract.git
cd nextract
pip install -e .[dev]

Python: 3.10+

Quick Start

JSON Schema output (default dict)

from nextract import extract

schema = {
    "title": "Invoice",
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "date": {"type": "string"},
        "total": {"type": "number"}
    },
    "required": ["invoice_number", "total"]
}

res = extract(
    files=["./docs/invoice.pdf"],
    schema_or_model=schema,
    user_prompt="Extract the invoice fields exactly as defined.",
    include_extra=True,  # adds a top-level `extra` bag for helpful unmodeled fields
)

print(res["data"])   # dict[str, Any] matching your schema (+ optional `extra`)
print(res["report"]) # model, usage, cost_estimate_usd, warnings

Pydantic model output

from pydantic import BaseModel
from nextract import extract

class Invoice(BaseModel):
    invoice_number: str
    date: str | None = None
    total: float

res = extract(
    files=["./docs/invoice.pdf"],
    schema_or_model=Invoice,
    user_prompt="Extract the invoice fields."
    # include_extra is ignored for Pydantic model mode
)

# Default behavior returns a dict
print(res["data"])  # -> {'invoice_number': '...', 'date': '...', 'total': ...}

To get the Pydantic model instance instead of a dict:

res = extract(
    files=["./docs/invoice.pdf"],
    schema_or_model=Invoice,
    user_prompt="Extract the invoice fields.",
    return_pydantic=True,
)
invoice_obj = res["data"]  # -> Invoice instance

Batch extraction (parallel)

Process each file independently (one Agent call per file):

from nextract import batch_extract

schema = {
    "title": "DocSummary",
    "type": "object",
    "properties": {"title": {"type": "string"}, "summary": {"type": "string"}},
    "required": ["title"]
}

res = batch_extract(
    batch=["./a.pdf", "./b.png", "./c.txt"],   # or [["./a1.pdf","./a2.pdf"], ["./b1.pdf"]] to group
    schema_or_model=schema,
    user_prompt="Summarize each document with title + summary.",
    include_extra=False,
    max_concurrency=4,
)

# result is a dict keyed by the first file in each item
print(res.keys())  # -> {"./a.pdf", "./b.png", "./c.txt"}

CLI

# JSON Schema
nextract extract ./invoice.pdf \
  --schema ./invoice.schema.json \
  --prompt "Extract the invoice fields." \
  --include-extra

# Pydantic model (module:Class or module.Class)
nextract extract ./invoice.pdf \
  --pydantic-model mypkg.models:Invoice

# Batch (parallel): schema mode
nextract batch ./a.pdf ./b.png ./c.txt \
  --schema ./summary.schema.json \
  --prompt "Summarize each document." \
  --max-concurrency 4

Run nextract --help, nextract extract --help, or nextract batch --help for more.

Configuration

Environment variables

Variable	Default	Description
`NEXTRACT_MODEL`	`openai:gpt-4o`	Pydantic AI model string (`provider:model-id`).
`NEXTRACT_MAX_CONCURRENCY`	`4`	Max parallel Agent calls in `batch_extract`.
`NEXTRACT_MAX_RUN_RETRIES`	`5`	Max retry attempts around Agent runs.
`NEXTRACT_PER_CALL_TIMEOUT_SECS`	`120`	Per-call timeout in seconds.
`NEXTRACT_PRICING`	(unset)	JSON map for cost estimation (see below).

Also set provider credentials as expected by Pydantic AI for your chosen provider. Example: for OpenAI: OPENAI_API_KEY=...

Pricing configuration

NEXTRACT_PRICING expects a JSON string like:

{
  "openai:gpt-4o": { "input_per_1k": 0.005, "output_per_1k": 0.015 },
  "openai:gpt-4.1-mini": { "input_per_1k": 0.003, "output_per_1k": 0.006 }
}

This is used to compute cost_estimate_usd from the Agent’s token usage. If the current model is missing in this map, cost will be null.

Model selection

By default, nextract uses openai:gpt-4o (vision-capable). You can override per process:

export NEXTRACT_MODEL="provider:model-id"

Or in Python, construct and pass your own RuntimeConfig (advanced—optional).

How it works

You pass file paths (single or multiple).
nextract prepares content:
- Textual files are read and injected as-is into the prompt, wrapped by:
```
--- BEGIN FILE: <name> (mime) ---
<file contents>
--- END FILE: <name> ---
```
- Binary files (PDFs, images, Office docs, others) are attached as binary parts using Pydantic AI’s BinaryContent.
An Agent is created with:
- a system prompt that instructs strict, schema-aligned extraction,
- an output_type of either:
  - StructuredDict(JSON Schema) → outputs a dict, or
  - Your Pydantic Model → outputs a model instance (dumped to dict by default).
For JSON Schema mode, a jsonschema validator runs as an output validator. On failure, the Agent is asked to retry briefly (limited rounds).
The result is validated again before returning. You get:
- data (dict by default),
- report with usage and optional cost estimate.

File type handling

Text: .txt, .md, .json, .yaml, .yml, .xml, .csv, .tsv, .html, .htm → Read as text (UTF‑8 with fallback), injected verbatim with file delimiters.
Excel: .xlsx (parsed to TSV via in‑process XML), .xls (CSV via CLI if available; else raw fallback) → Read as text and injected like other textual files. Best‑effort extraction (no styling/formatting).
PDF / Images: .pdf, .png, .jpg, .jpeg, .webp, .gif, .bmp, .tiff → Attached as binary bytes for the model (vision-capable models recommended).
Office docs: .doc, .docx, .ppt, .pptx → Converted to PDF via LibreOffice/soffice or unoconv if available; on failure, attached as original binary.
ZIP: Extracted to /tmp/nextract-zip-<name>; each inner file is processed as above. No nested recursion.

No OCR: Scanned PDFs/images are not OCR’d. If the model can’t read them natively, fields may be missing.

Office → PDF conversion

nextract attempts to convert .doc/.docx/.ppt/.pptx to PDF using system tools. These are external dependencies and are not installed via pip.

Preferred: soffice (LibreOffice) in headless mode.
Fallback: unoconv (uses LibreOffice UNO).

Installation hints:

macOS (Homebrew):
- brew install --cask libreoffice
- Ensure soffice is on your PATH. If not, you can symlink:
  - ln -s "/Applications/LibreOffice.app/Contents/MacOS/soffice" /usr/local/bin/soffice (adjust for Apple Silicon/Homebrew prefix)
Ubuntu/Debian:
- sudo apt-get update && sudo apt-get install -y libreoffice
- Optional: sudo apt-get install -y unoconv
Fedora/CentOS/RHEL:
- sudo dnf install -y libreoffice (or yum on older systems)
- Optional: install unoconv from your distro repos if available.
Windows:
- Install LibreOffice from https://www.libreoffice.org/download/ and add soffice.exe to your PATH.

If neither tool is found, nextract logs a warning and falls back to attaching the original Office binary.

Examples & Few-shot Hints

You can supply examples to guide the model:

Programmatic (examples argument):

Output-only examples: list[dict]
Paired input/output: list[tuple[str | None, dict]]

CLI (--examples JSON file):

Output-only examples:

[
  { "invoice_number": "INV-001", "total": 123.45 }
]

Paired input/output (use a two-element array):

[
  ["Item: Widget A, Total: 123.45", { "invoice_number": "INV-001", "total": 123.45 }]
]

“Extra” fields (JSON Schema mode): If you pass include_extra=True, your schema is augmented with a top-level:

"extra": { "type": "object", "additionalProperties": true }

so the model can place relevant-but-unspecified fields there.

Return shape

All entry points return a dict with this structure:

{
  "data": { /* your structured result (dict by default) */ },
  "report": {
    "model": "provider:model-id",
    "files": ["..."],
    "usage": {
      "requests": 1,
      "tool_calls": 0,
      "input_tokens": 123,
      "output_tokens": 456,
      "details": { /* provider-dependent */ }
    },
    "cost_estimate_usd": 0.0123,
    "warnings": []
  }
}

In Pydantic model mode, data is still a dict unless you passed return_pydantic=True, in which case it’s the model instance.

Logging & Tracing

Uses structlog; logs are JSON-formatted to stdout.
Each extraction logs: model, files, usage, warnings, and cost estimate.
You can set up your own logging before calling extract/batch_extract. By default, the library initializes logging for you (toggle via setup_logs=False).

Retries, Rate Limits & Timeouts

Each Agent call is wrapped with exponential backoff (max attempts from NEXTRACT_MAX_RUN_RETRIES).
Timeout per call is NEXTRACT_PER_CALL_TIMEOUT_SECS (default 120s).
In batch mode, up to NEXTRACT_MAX_CONCURRENCY tasks run in parallel (default 4).

Large files (TODO)

Planned design (not implemented in this build):

Chunking (semantic/page) for large inputs.
Per-chunk extraction with all fields optional, then merge into a full model validated against the target schema/model.
Pluggable conflict resolution & optional provenance.

Limitations

No OCR, no readability parsing for HTML.
Office conversions require soffice (LibreOffice) or unoconv installed; otherwise we fall back to attaching the original binary.
Office file understanding depends on the model/provider.
Very large inputs may exceed model or provider limits.
ZIP extraction writes to /tmp/nextract-zip-<name>; these temp files are not auto-deleted by the library.

FAQ

Q: Which providers/models can I use? A: Any supported by Pydantic AI Agent. Select via NEXTRACT_MODEL="provider:model-id" and set the provider’s expected credentials (e.g., OPENAI_API_KEY).

Q: What happens if schema validation keeps failing? A: The Agent is asked to retry a couple of times. Final results are validated once more; if still invalid, you’ll see a final_validation_error entry under report.warnings.

Q: Can I store or inspect attachments that nextract sends? A: This build sends raw text or binary bytes directly to the Agent. If you need durable storage or redaction, wrap nextract in your own pipeline.

Q: Can I get a Pydantic model out? A: Yes—pass your model class to schema_or_model and set return_pydantic=True.

Development

Building & Testing

# Install development dependencies
pip install -e .[dev]

# Run tests
pytest

# Run linting
ruff check nextract

# Build package
python -m build

# Test installation
pip install dist/*.whl

CI/CD

This project uses GitHub Actions for continuous integration and automated PyPI publishing:

CI: Runs on every push/PR, testing across Python 3.10-3.12
Release: Automatically publishes to PyPI when GitHub releases are created
Versioning: Managed statically in pyproject.toml (current: 0.0.1)

Creating a Release

Bump version in pyproject.toml, commit, and push.
Create a GitHub release (with notes) — this triggers automatic PyPI publishing.

License

MIT. Feel free to adapt and extend.

Project Structure (for reference)

nextract/
  ├─ nextract/
  │  ├─ __init__.py            # exports extract, batch_extract
  │  ├─ version.py
  │  ├─ config.py              # RuntimeConfig (model, concurrency, timeouts, pricing)
  │  ├─ logging.py             # structlog setup
  │  ├─ mimetypes_map.py       # simple mapping & helpers
  │  ├─ schema.py              # JSON Schema/Pydantic utilities
  │  ├─ prompts.py             # system prompt + examples builder
  │  ├─ files.py               # read-as-is; BinaryContent or text
  │  ├─ pricing.py             # usage → cost estimate
  │  ├─ agent_runner.py        # Agent wiring, retries, validation, metrics
  │  ├─ core.py                # public API: extract, batch_extract
  │  └─ cli.py                 # Typer CLI
  └─ pyproject.toml

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

May 12, 2026

0.1.2

Mar 23, 2026

0.1.1

Oct 13, 2025

This version

0.1.0

Sep 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nextract-0.1.0.tar.gz (30.2 kB view details)

Uploaded Sep 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nextract-0.1.0-py3-none-any.whl (26.4 kB view details)

Uploaded Sep 25, 2025 Python 3

File details

Details for the file nextract-0.1.0.tar.gz.

File metadata

Download URL: nextract-0.1.0.tar.gz
Upload date: Sep 25, 2025
Size: 30.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for nextract-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`86853ac5fb614fa467f603d52e66579b4b485b1872cfef8f699ef5c3939631af`
MD5	`1ceaa28cf21ef7a5618aa48ae1c3a350`
BLAKE2b-256	`a7e8b7ae4d24546b7940b2bac3024974030997c1ad635cad87c26ebb0169774d`

See more details on using hashes here.

File details

Details for the file nextract-0.1.0-py3-none-any.whl.

File metadata

Download URL: nextract-0.1.0-py3-none-any.whl
Upload date: Sep 25, 2025
Size: 26.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.7

File hashes

Hashes for nextract-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6900c710ab53508ebdcf77d7e4d053eefb327ff08b2b36edc0f48a22aa1f10c8`
MD5	`5125cc45b2b979834f47227f8fca5a02`
BLAKE2b-256	`f000ac54b2d563add81d9574fe797dc5c17a7faea399c1705bb8a246b7fc639d`

See more details on using hashes here.

nextract 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

nextract

Table of Contents

Features

What’s in / out of scope

Installation

Quick Start

JSON Schema output (default dict)

Pydantic model output

Batch extraction (parallel)

CLI

Configuration

Environment variables

Pricing configuration

Model selection

How it works

File type handling

Office → PDF conversion

Examples & Few-shot Hints

Return shape

Logging & Tracing

Retries, Rate Limits & Timeouts

Large files (TODO)

Limitations

FAQ

Development

Building & Testing

CI/CD

Creating a Release

License

Project Structure (for reference)

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes