Petey — The Easy PDF Extractor

These details have not been verified by PyPI

Project links

Project description

Petey

Petey is a framework for PDF data extraction. It wires the PDF parser of your choice to the LLM of your choice, and with a simple schema from the user, pulls data out of PDF documents.

pip install petey

For demos, tutorials, and benchmarks, visit the Petey blog.

Why Petey?

The PDF format was designed to look identical on any screen or printer. It was format and technology agnostic, a universal container for the printed page. But all that mattered was its visual presentation. As long as it rendered correctly, the internal representation didn't matter.

And so the inside of a PDF is often chaotic. It is just a bunch of items — words, characters, shapes, images — and their coordinates, with little or no regard for the relationship between anything. What reads as one cohesive line of text could be three groups of words that happened to be positioned sequentially with the same y-value.

A lot of hard-working folks have developed tools to extract text from PDFs over the years. AI can be a big help too — you don't need a particularly advanced LLM to interpret some fairly difficult documents. But models need infrastructure, and not everyone has time to wire it all together.

Petey does the wiring for you. Just pass it your files and a schema that explains what you want, and it returns a JSON or CSV with your data.

How it works

Parse — extract text from the PDF using a local or cloud parser
LLM — send the text to an LLM with your schema to get the fields you want back
Output — return the results as JSON or CSV

Parsers

Parser	Install	Best for
`pymupdf`	included	Most documents. Reads embedded text directly. Fast, free, default.
`pdfplumber`	included	Borderless tables. Layout-preserving spatial extraction.
`tables`	included	Bordered tables. Uses PyMuPDF's table detection.
`marker`	included	Complex/scanned layouts. Remote API via Datalab. Requires `DATALAB_API_KEY`.
`unstructured`	included	General-purpose. Remote API. Requires `UNSTRUCTURED_API_KEY`.

See petey list parsers for all available parsers.

OCR Backends

If a PDF has no embedded text (e.g. scanned documents), Petey falls back to OCR. Only triggered when extracted text is very short.

Backend	Install	How it works
`none`	—	No OCR. Default.
`tesseract`	included	Local OCR. Requires Tesseract binary installed.
`chandra`	included	Cloud OCR via Datalab. Requires `DATALAB_API_KEY`.
`surya`	included	Cloud OCR via Datalab. Requires `DATALAB_API_KEY`.
`mistral`	`pip install petey[mistral-ocr]`	Cloud OCR via Mistral. Requires `MISTRAL_API_KEY`.

See petey list ocr for all available backends.

LLM Backends

Petey auto-detects the right backend from the model name.

Backend	Models	Auto-detected when
`openai`	`gpt-4.1-mini`, `gpt-4o`, etc.	Default
`anthropic`	`claude-sonnet-4-6`, `claude-haiku-4-5`, etc.	Model starts with `claude`
`litellm`	Gemini, DeepSeek, Fireworks, Ollama, Bedrock, 100+ more	Model has a provider prefix (e.g. `gemini/`, `deepseek/`, `fireworks_ai/`)

Setup

Add your API key to a .env file:

OPENAI_API_KEY=sk-...

Or for other providers:

ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
DATALAB_API_KEY=...

Schemas

Every extraction starts with a schema — a YAML file that tells Petey what to look for.

name: Invoice
fields:
  vendor:
    type: string
    description: Company name on the invoice
  amount:
    type: number
    description: Total amount due
  date:
    type: date
  status:
    type: category
    values: [Paid, Unpaid, Overdue]

Field types

Type	Notes
`string`	Any text value
`number`	Integer or decimal
`date`	Returns ISO 8601 format
`category`	Constrained set of values. List `values:` to enforce them. Case-insensitive matching.

All fields are nullable — Petey returns null for anything it can't find rather than guessing.

Schema options

Option	Description
`mode: table`	Extract multiple records per page (default: `query` — one record per file)
`instructions`	Extra guidance appended to the prompt
`header_pages`	Number of leading pages to prepend to every chunk (for context like column headers)
`pages`	Page range to process, e.g. `"2-5"` or `"1,3,5-7"`
`input`	Default PDF path or directory
`output`	Default output file path
`parser`	Default parser
`ocr`	Default OCR backend

CLI

# Basic extraction
petey extract --schema invoice.yaml ./invoices/ -o results.csv

# With options
petey extract --schema schema.yaml --model claude-sonnet-4-6 --parser marker ./pdfs/

# List available backends
petey list parsers
petey list ocr
petey list llm

Flag	Default	Description
`--schema / -s`	required	Path to YAML schema
`--model / -m`	`gpt-4.1-mini`	LLM model ID
`--parser`	`pymupdf`	Text extraction backend
`--ocr`	`none`	OCR backend
`--concurrency / -c`	`10`	Max concurrent API calls
`--output / -o`	stdout	Output file path
`--format / -f`	inferred	`csv`, `json`, or `jsonl`
`--mode`	from schema	`query` or `table`
`--header-pages`	from schema	Header pages to prepend to each chunk
`--page-range`	from schema	Page range to extract

Python API

from petey import extract, load_schema

schema, spec = load_schema("invoice.yaml")

result = extract("invoice.pdf", schema)

# With options
result = extract(
    "invoice.pdf",
    schema,
    model="claude-sonnet-4-6",
    parser="marker",
    ocr_backend="chandra",
)

Optional Dependencies

pip install petey                    # Core (pymupdf, pdfplumber, tesseract, litellm)
pip install petey[mistral-ocr]       # + Mistral OCR
pip install petey[tabula]            # + Tabula table extraction (requires Java)
pip install petey[unstructured]      # + Unstructured API client
pip install petey[all]               # Everything

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.1

May 27, 2026

0.5.0

May 24, 2026

0.4.1

May 9, 2026

0.4.0

May 9, 2026

0.3.2

Apr 12, 2026

0.3.1

Apr 10, 2026

0.3.0

Mar 31, 2026

This version

0.2.1

Mar 29, 2026

0.2.0

Mar 26, 2026

0.1.9

Mar 24, 2026

0.1.8

Mar 21, 2026

0.1.6

Mar 16, 2026

0.1.5

Mar 16, 2026

0.1.4

Mar 16, 2026

0.1.3

Mar 16, 2026

0.1.2

Mar 10, 2026

0.1.1

Mar 9, 2026

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

petey-0.2.1.tar.gz (45.3 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

petey-0.2.1-py3-none-any.whl (36.5 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file petey-0.2.1.tar.gz.

File metadata

Download URL: petey-0.2.1.tar.gz
Upload date: Mar 29, 2026
Size: 45.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for petey-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`80fa05ad0171bbab11adaadbf067a532487cbd6d347b86fc4741725e4cccb718`
MD5	`66422ae2ffa6cb696c9a2b8efcfd4bae`
BLAKE2b-256	`bba0fb63834e914cec7ff57299b1bfaa4d56c4207686d53dfaedd50ecbe2e7f0`

See more details on using hashes here.

File details

Details for the file petey-0.2.1-py3-none-any.whl.

File metadata

Download URL: petey-0.2.1-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 36.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for petey-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`637dfa15f66e4e1949625799825768748e72c0c7fb3cbde59d04fe4e2b8da670`
MD5	`dbfd3f97b1daecdd171ee863754da494`
BLAKE2b-256	`9a35d43b9c29f8e80feb6d447006a5b391b064da9b72561ebf7fba65751496ee`

See more details on using hashes here.

petey 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Petey

Why Petey?

How it works

Parsers

OCR Backends

LLM Backends

Setup

Schemas

Field types

Schema options

CLI

Python API

Optional Dependencies

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes