Petey — The Easy PDF Extractor
Project description
Petey
The Easy PDF Extractor. Define a YAML schema, point at your PDFs, get structured data back.
Petey uses LLMs (OpenAI or Anthropic) to extract structured fields from PDF documents. You describe what you want in a YAML schema, and Petey handles text extraction, LLM prompting, and output formatting.
Install
pip install .
Or in editable/dev mode:
pip install -e ".[dev]"
Quick start
- Set your API key:
export OPENAI_API_KEY=sk-...
# or
export ANTHROPIC_API_KEY=sk-ant-...
- Write a schema (YAML):
name: Invoice
fields:
vendor:
type: string
description: Company name on the invoice
amount:
type: number
description: Total amount due
date:
type: date
description: Invoice date
status:
type: enum
values: [Paid, Unpaid, Overdue]
description: Payment status
- Run it:
petey extract --schema invoice.yaml ./invoices/ -o results.csv
CLI usage
petey extract --schema SCHEMA PATHS... [options]
| Option | Description |
|---|---|
--schema, -s |
YAML schema file (required) |
--model, -m |
Model ID (default: gpt-4.1-mini, or set PETEY_MODEL) |
--output, -o |
Output file (.csv, .json, or .jsonl) |
--format, -f |
Output format (inferred from -o if not set) |
--concurrency, -c |
Concurrent API requests (default: 10) |
--instructions, -i |
Additional extraction instructions |
PATHS can be individual PDF files or directories (all .pdf files inside will be processed).
Examples:
# Single file, JSON to stdout
petey extract -s schema.yaml report.pdf -f json
# Directory, CSV output
petey extract -s schema.yaml ./pdfs/ -o results.csv
# Anthropic model, limited concurrency
petey extract -s schema.yaml ./pdfs/ -m claude-haiku-4-5-20251001 -c 5 -o out.jsonl
Python API
from petey import load_schema, extract, extract_batch
# Load schema
model, spec = load_schema("schema.yaml")
# Single file (sync)
result = extract("doc.pdf", model, model="gpt-4.1-mini")
print(result.model_dump())
# Batch (async)
import asyncio
results = asyncio.run(
extract_batch(
["a.pdf", "b.pdf", "c.pdf"],
model,
model="gpt-4.1-mini",
concurrency=10,
)
)
Functions
load_schema(path)— Load a YAML schema, returns(PydanticModel, spec_dict)build_model(spec)— Build a Pydantic model from a spec dict directlyextract(pdf_path, response_model, *, model, api_key, instructions)— Extract from one PDF (sync)extract_async(...)— Same as above, asyncextract_batch(pdf_paths, response_model, *, model, api_key, instructions, concurrency, on_result)— Extract from multiple PDFs concurrently. Optionalon_result(path, data)callback fires as each file completes.extract_text(pdf_path)— Just get the raw text from a PDF (PyMuPDF)
Schema format
name: MySchema # optional, used for the Pydantic model name
record_type: array # optional, use for table extraction (multiple records per doc)
instructions: | # optional, appended to system prompt
Focus on the header section for dates.
fields:
field_name:
type: string # string, number, date, enum, or array
description: What this field contains
category:
type: enum
values: [A, B, C] # optional — omit to let the LLM infer values
line_items:
type: array
description: Table rows
fields:
item: { type: string, description: Item name }
qty: { type: number, description: Quantity }
All fields are nullable — the LLM returns null for anything it can't find.
Development
make install # create venv + install with dev deps
make test # run tests
make clean # remove venv
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file petey-0.1.0.tar.gz.
File metadata
- Download URL: petey-0.1.0.tar.gz
- Upload date:
- Size: 10.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e00e53a7a81282f886d763b5a76181237a286a5c8ae27f907f1fff226fb68fe
|
|
| MD5 |
60fe3adfc7541c71e8a2903a45d47a9c
|
|
| BLAKE2b-256 |
a921f23124e67445a77b8f973819008df83ec2f6338f9617b8565d7ee91a9794
|
File details
Details for the file petey-0.1.0-py3-none-any.whl.
File metadata
- Download URL: petey-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
57dbd8ce43224ac43648ef00888d024aed0af64850d6bb75482fe38c40117c39
|
|
| MD5 |
9ab687912bc16e4c807c34e7a737c4f3
|
|
| BLAKE2b-256 |
41e832bd66c94ddb4de6d35bf9a9b4ef3901d0c18787a5fb8e6329c82afb9113
|