Skip to main content

Petey — The Easy PDF Extractor

Project description

Petey

The Easy PDF Extractor. Define a YAML schema, point at your PDFs, get structured data back.

Petey uses LLMs (OpenAI or Anthropic) to extract structured fields from PDF documents. You describe what you want in a YAML schema, and Petey handles text extraction, LLM prompting, and output formatting.

Install

pip install .

Or in editable/dev mode:

pip install -e ".[dev]"

Quick start

  1. Set your API key:
export OPENAI_API_KEY=sk-...
# or
export ANTHROPIC_API_KEY=sk-ant-...
  1. Write a schema (YAML):
name: Invoice
fields:
  vendor:
    type: string
    description: Company name on the invoice
  amount:
    type: number
    description: Total amount due
  date:
    type: date
    description: Invoice date
  status:
    type: enum
    values: [Paid, Unpaid, Overdue]
    description: Payment status
  1. Run it:
petey extract --schema invoice.yaml ./invoices/ -o results.csv

CLI usage

petey extract --schema SCHEMA PATHS... [options]
Option Description
--schema, -s YAML schema file (required)
--model, -m Model ID (default: gpt-4.1-mini, or set PETEY_MODEL)
--output, -o Output file (.csv, .json, or .jsonl)
--format, -f Output format (inferred from -o if not set)
--concurrency, -c Concurrent API requests (default: 10)
--instructions, -i Additional extraction instructions

PATHS can be individual PDF files or directories (all .pdf files inside will be processed).

Examples:

# Single file, JSON to stdout
petey extract -s schema.yaml report.pdf -f json

# Directory, CSV output
petey extract -s schema.yaml ./pdfs/ -o results.csv

# Anthropic model, limited concurrency
petey extract -s schema.yaml ./pdfs/ -m claude-haiku-4-5-20251001 -c 5 -o out.jsonl

Python API

from petey import load_schema, extract, extract_batch

# Load schema
model, spec = load_schema("schema.yaml")

# Single file (sync)
result = extract("doc.pdf", model, model="gpt-4.1-mini")
print(result.model_dump())

# Batch (async)
import asyncio

results = asyncio.run(
    extract_batch(
        ["a.pdf", "b.pdf", "c.pdf"],
        model,
        model="gpt-4.1-mini",
        concurrency=10,
    )
)

Functions

  • load_schema(path) — Load a YAML schema, returns (PydanticModel, spec_dict)
  • build_model(spec) — Build a Pydantic model from a spec dict directly
  • extract(pdf_path, response_model, *, model, api_key, instructions) — Extract from one PDF (sync)
  • extract_async(...) — Same as above, async
  • extract_batch(pdf_paths, response_model, *, model, api_key, instructions, concurrency, on_result) — Extract from multiple PDFs concurrently. Optional on_result(path, data) callback fires as each file completes.
  • extract_text(pdf_path) — Just get the raw text from a PDF (PyMuPDF)

Schema format

name: MySchema          # optional, used for the Pydantic model name
record_type: array      # optional, use for table extraction (multiple records per doc)
instructions: |         # optional, appended to system prompt
  Focus on the header section for dates.

fields:
  field_name:
    type: string        # string, number, date, enum, or array
    description: What this field contains

  category:
    type: enum
    values: [A, B, C]   # optional — omit to let the LLM infer values

  line_items:
    type: array
    description: Table rows
    fields:
      item: { type: string, description: Item name }
      qty: { type: number, description: Quantity }

All fields are nullable — the LLM returns null for anything it can't find.

Development

make install   # create venv + install with dev deps
make test      # run tests
make clean     # remove venv

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

petey-0.1.0.tar.gz (10.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

petey-0.1.0-py3-none-any.whl (8.9 kB view details)

Uploaded Python 3

File details

Details for the file petey-0.1.0.tar.gz.

File metadata

  • Download URL: petey-0.1.0.tar.gz
  • Upload date:
  • Size: 10.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for petey-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5e00e53a7a81282f886d763b5a76181237a286a5c8ae27f907f1fff226fb68fe
MD5 60fe3adfc7541c71e8a2903a45d47a9c
BLAKE2b-256 a921f23124e67445a77b8f973819008df83ec2f6338f9617b8565d7ee91a9794

See more details on using hashes here.

File details

Details for the file petey-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: petey-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.5

File hashes

Hashes for petey-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 57dbd8ce43224ac43648ef00888d024aed0af64850d6bb75482fe38c40117c39
MD5 9ab687912bc16e4c807c34e7a737c4f3
BLAKE2b-256 41e832bd66c94ddb4de6d35bf9a9b4ef3901d0c18787a5fb8e6329c82afb9113

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page