Petey — The Easy PDF Extractor
Project description
Petey
Petey is a framework for PDF data extraction. It wires the PDF parser of your choice to the LLM of your choice, and with a simple schema from the user, pulls data out of PDF documents.
pip install petey
For demos, tutorials, and benchmarks, visit the Petey blog.
Why Petey?
The PDF format was designed to look identical on any screen or printer. It was format and technology agnostic, a universal container for the printed page. But all that mattered was its visual presentation. As long as it rendered correctly, the internal representation didn't matter.
And so the inside of a PDF is often chaotic. It is just a bunch of items — words, characters, shapes, images — and their coordinates, with little or no regard for the relationship between anything. What reads as one cohesive line of text could be three groups of words that happened to be positioned sequentially with the same y-value.
A lot of hard-working folks have developed tools to extract text from PDFs over the years. AI can be a big help too — you don't need a particularly advanced LLM to interpret some fairly difficult documents. But models need infrastructure, and not everyone has time to wire it all together.
Petey does the wiring for you. Just pass it your files and a schema that explains what you want, and it returns a JSON or CSV with your data.
How it works
- Parse — extract text from the PDF using a local or cloud parser
- LLM — send the text to an LLM with your schema to get the fields you want back
- Output — return the results as JSON or CSV
Parsers
| Parser | Install | Best for |
|---|---|---|
pymupdf |
included | Most documents. Reads embedded text directly, auto-OCRs scanned pages. Fast, free, default. |
pdfplumber |
included | Borderless tables. Layout-preserving spatial extraction. Text-only (no OCR). |
datalab |
included | Scanned/complex layouts. Remote API via Datalab. Requires DATALAB_API_KEY. |
unstructured |
included | General-purpose. Remote API. Requires UNSTRUCTURED_API_KEY. |
See petey list parsers for all available parsers.
LLM Backends
Petey auto-detects the right backend from the model name.
| Backend | Models | Auto-detected when |
|---|---|---|
openai |
gpt-4.1-mini, gpt-4o, etc. |
Default |
anthropic |
claude-sonnet-4-6, claude-haiku-4-5, etc. |
Model starts with claude |
litellm |
Gemini, DeepSeek, Fireworks, Ollama, Bedrock, 100+ more | Model has a provider prefix (e.g. gemini/, deepseek/, fireworks_ai/) |
Setup
Add your API key to a .env file:
OPENAI_API_KEY=sk-...
Or for other providers:
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
DATALAB_API_KEY=...
Schemas
Every extraction starts with a schema — a YAML file that tells Petey what to look for.
name: Invoice
fields:
vendor:
type: string
description: Company name on the invoice
amount:
type: number
description: Total amount due
date:
type: date
status:
type: category
values: [Paid, Unpaid, Overdue]
Field types
| Type | Notes |
|---|---|
string |
Any text value |
number |
Integer or decimal |
date |
Returns ISO 8601 format |
category |
Constrained set of values. List values: to enforce them. Case-insensitive matching. |
All fields are nullable — Petey returns null for anything it can't find rather than guessing.
Schema options
| Option | Description |
|---|---|
mode: table |
Extract multiple records per page (default: query — one record per file) |
instructions |
Extra guidance appended to the prompt |
header_pages |
Number of leading pages to prepend to every chunk (for context like column headers) |
pages |
Page range to process, e.g. "2-5" or "1,3,5-7" |
input |
Default PDF path or directory |
output |
Default output file path |
parser |
Default parser |
ocr |
Default OCR backend |
CLI
# Basic extraction
petey extract --schema invoice.yaml ./invoices/ -o results.csv
# With options
petey extract --schema schema.yaml --model claude-sonnet-4-6 --parser datalab ./pdfs/
# List available backends
petey list parsers
petey list ocr
petey list llm
| Flag | Default | Description |
|---|---|---|
--schema / -s |
required | Path to YAML schema |
--model / -m |
gpt-4.1-mini |
LLM model ID |
--parser |
pymupdf |
Text extraction backend |
--concurrency / -c |
10 |
Max concurrent API calls |
--output / -o |
stdout | Output file path |
--format / -f |
inferred | csv, json, or jsonl |
--mode |
from schema | query or table |
--header-pages |
from schema | Header pages to prepend to each chunk |
--page-range |
from schema | Page range to extract |
Python API
from petey import extract, load_schema
schema, spec = load_schema("invoice.yaml")
result = extract("invoice.pdf", schema)
# With options
result = extract(
"invoice.pdf",
schema,
model="claude-sonnet-4-6",
parser="datalab",
)
Optional Dependencies
pip install petey # Core (pymupdf, pdfplumber, litellm)
pip install petey[unstructured] # + Unstructured API client
pip install petey[all] # Everything
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file petey-0.3.0.tar.gz.
File metadata
- Download URL: petey-0.3.0.tar.gz
- Upload date:
- Size: 42.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
20c7c95836d962d5bbe23182e9647bd22b41c942222fa026aaf84b319dd9335e
|
|
| MD5 |
b78bc2e7d5c8b9001cd79efc402f7159
|
|
| BLAKE2b-256 |
48fb864569167de6c80a0af1aab277c48da35ff14375e714210e06c13fa9d121
|
File details
Details for the file petey-0.3.0-py3-none-any.whl.
File metadata
- Download URL: petey-0.3.0-py3-none-any.whl
- Upload date:
- Size: 30.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9710f2c1fac320ed100f502153fdf9a77359f8955da1b02810c499f5a8c008c
|
|
| MD5 |
2b20c3d6be68d7d764e88d6541c6a323
|
|
| BLAKE2b-256 |
8154f1bda8fc43a20ddb463244600b23302d7ec4ad1902b6b92a1e2f9abf5393
|