Skip to main content

Extract tables from PDFs using Mistral OCR

Project description

Alice PDF

PyPI

CLI tool to extract tables from PDFs using Camelot (default, free), Mistral OCR (Pixtral vision model), AWS Textract, or pdfplumber and convert them to machine-readable CSV files.

Dedicated to Alice Corona e Marco Corona, and the entire onData community.

Features

  • Four extraction engines: Camelot (free, local, native PDFs), Mistral (schema-driven, scanned PDFs), AWS Textract (managed service), or pdfplumber (robust, works on both native and scanned PDFs)
  • Extract tables from multi-page PDFs
  • Support page selection (ranges or lists)
  • Optional YAML schema for improved extraction accuracy (Mistral only)
  • CSV output per page or merged into single file
  • Configurable DPI and engine-specific options

Installation

Prerequisites: Python 3.8+.

Quick install (pip): pip install -U alice-pdf

Install globally from PyPI (choose one):

  • pip install alice-pdf
  • uv tool install alice-pdf (requires uv)

Upgrade to the latest release at any time:

pip install -U alice-pdf
# or
uv tool upgrade alice-pdf

Requirements

For Camelot engine:

  • Python 3.8+
  • camelot-py library (included in install)
  • Works with native PDFs (not scanned images)

For Mistral engine:

For pdfplumber engine:

  • Python 3.8+
  • pdfplumber library (included in install)
  • Works on both native and scanned PDFs
  • Handles complex table structures better than Camelot
  • Free and local extraction

For Textract engine:

  • Python 3.8+
  • AWS credentials with Textract permissions
  • boto3 library (included in install)

Usage

Setup

Camelot (default, no setup needed):

No API key required! Just install and use.

Mistral:

Option 1 - Environment variables (recommended for uv run):

export MISTRAL_API_KEY="your-api-key"

Option 2 - CLI parameters (recommended for uv tool install):

alice-pdf input.pdf output/ --engine mistral --api-key "your-api-key"
# alias: --mistral-api-key

Option 3 - .env file (only works with uv run, not with uv tool install):

# Create .env file in project directory
echo 'MISTRAL_API_KEY="your-api-key"' > .env
uv run alice-pdf input.pdf output/ --engine mistral

Textract:

Option 1 - Environment variables (recommended):

export AWS_ACCESS_KEY_ID="your-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="eu-west-1"

Option 2 - CLI parameters:

alice-pdf input.pdf output/ --engine textract \
  --aws-region eu-west-1 \
  --aws-access-key-id "your-key-id" \
  --aws-secret-access-key "your-secret-key"

Costi: il motore Textract qui usa solo FeatureTypes=["TABLES"] per tenere il costo a ~0,015 USD/pagina. Il feature FORMS (~0,050 USD/pagina) non è abilitato.

Note: .env file support is only available for Mistral and only when running with uv run. For Textract, always use environment variables or CLI parameters.

Basic commands

# Extract with Camelot (default, free, no API)
alice-pdf input.pdf output/

# Extract with Mistral (for scanned PDFs)
alice-pdf input.pdf output/ --engine mistral

# Extract with Textract
alice-pdf input.pdf output/ --engine textract --aws-region eu-west-1

# Extract with pdfplumber (robust, works on both native and scanned PDFs)
alice-pdf input.pdf output/ --engine pdfplumber

# Extract with pdfplumber with minimum table size constraints
alice-pdf input.pdf output/ --engine pdfplumber --pdfplumber-min-rows 2 --pdfplumber-min-cols 3

# Extract with Camelot (local, fast for native PDFs)
alice-pdf input.pdf output/ --engine camelot --camelot-flavor stream

# Camelot: fix for tables with merged cells
alice-pdf input.pdf output/ --engine camelot --camelot-split-text --merge

# Specific pages
alice-pdf input.pdf output/ --pages "1-3,5"

# Merge all tables into one CSV
alice-pdf input.pdf output/ --merge

# With table schema for better accuracy (Mistral only)
alice-pdf input.pdf output/ --schema table_schema.yaml

# Debug mode
alice-pdf input.pdf output/ --debug

Options

Common:

  • --engine {mistral,textract,camelot,pdfplumber}: Extraction engine (default: camelot)
  • --pages: Pages to process (default: all). Examples: "1", "1-3", "1,3,5"
  • --dpi: Image resolution (default: 150)
  • -m, --merge: Merge all tables into single CSV
  • --no-resume: Clear output and reprocess all pages
  • -d, --debug: Enable debug logging

Mistral-specific:

  • --model: Mistral model (default: pixtral-12b-2409)
  • --schema: Path to YAML/JSON schema file for custom prompt generation
  • --prompt: Custom prompt (overrides --schema)
  • --api-key: Mistral API key (alternative to env var)
  • --timeout-ms: HTTP timeout in milliseconds (default: 60000)

Textract-specific:

  • --aws-region: AWS region (or set AWS_DEFAULT_REGION)
  • --aws-access-key-id: AWS access key (or set AWS_ACCESS_KEY_ID)
  • --aws-secret-access-key: AWS secret key (or set AWS_SECRET_ACCESS_KEY)

Camelot-specific:

  • --camelot-flavor {lattice,stream}: Extraction mode (default: lattice)
    • lattice: For tables with visible borders
    • stream: For tables without borders (whitespace-based)
  • --camelot-split-text: Split text spanning multiple cells (useful for complex tables with merged cells)

pdfplumber-specific:

  • --pdfplumber-min-rows: Minimum number of rows for table detection (default: 1)
  • --pdfplumber-min-cols: Minimum number of columns for table detection (default: 1)
  • --pdfplumber-strip-text / --no-pdfplumber-strip-text: Enable/disable whitespace stripping in extracted text (default: strip)

Table Schema

To improve extraction accuracy, create a YAML file describing the table structure:

name: "housing_properties"
description: "Housing properties table"

columns:
  - name: "PROPERTY"
    description: "Property owner name"
    examples:
      - "ATER DI VENEZIA"
      - "COMUNE DI VENEZIA"

  - name: "UNIT"
    description: "Housing unit number"
    examples:
      - "2950010"
      - "170"

notes:
  - "Keep columns separate"
  - "Do NOT merge adjacent cells"
  - "All rows should have exactly N columns"

How it works

Camelot engine (default)

  1. Converts PDF pages to raster images (150 DPI default)
  2. Sends images to Mistral API with structured prompt
  3. Mistral API (Pixtral) analyzes image and extracts tables as JSON
  4. Converts JSON to pandas DataFrame
  5. Saves CSV per page + optional merge
  6. Adds 'page' column for traceability

Progressive Timeout Retry:

When a page times out, the tool automatically retries with doubled timeouts:

  • Attempt 1: 60 seconds (default timeout)
  • Attempt 2: 120 seconds (2x timeout, if first attempt times out)
  • Attempt 3: 240 seconds (4x timeout, if second attempt times out)

After 3 failed attempts, the page is skipped and processing continues with the next page. Non-timeout errors (authentication, rate limits, etc.) skip retry and move to the next page immediately.

Textract engine

  1. Converts PDF pages to raster images (150 DPI default)
  2. Sends images to AWS Textract API
  3. Textract analyzes document structure and extracts tables
  4. Converts Textract response to pandas DataFrame
  5. Saves CSV per page + optional merge
  6. Adds 'page' column for traceability

Note: Textract does not support schema/prompt customization. Use Mistral if you need custom prompts.

Camelot engine

  1. Reads native PDF structure (no image conversion needed)
  2. Detects tables using borders (lattice) or whitespace (stream)
  3. Converts to pandas DataFrame
  4. Saves CSV per page + optional merge
  5. Adds 'page' column for traceability

Best for: Native PDFs (not scanned) with clear table structure. Fast and free (local processing).

Output

Each extracted table is saved as:

  • {pdf_name}_page{N}_table{i}.csv: CSV per table
  • {pdf_name}_merged.csv: All tables merged (if --merge)

Examples

Example 1: Basic extraction (Camelot)

alice-pdf document.pdf output/

Example 2: Mistral extraction (for scanned PDFs)

alice-pdf document.pdf output/ \
  --engine mistral \
  --merge

Example 3: pdfplumber extraction (robust, works on both native and scanned PDFs)

alice-pdf document.pdf output/ \
  --engine pdfplumber \
  --pdfplumber-min-rows 2 \
  --pdfplumber-min-cols 3 \
  --merge

Example 4: Textract extraction

alice-pdf document.pdf output/ \
  --engine textract \
  --aws-region eu-west-1 \
  --merge

Example 5: Mistral with schema and merge

alice-pdf document.pdf output/ \
  --engine mistral \
  --schema table_schema.yaml \
  --pages "2-10" \
  --merge

Example 6: High resolution and debug

alice-pdf document.pdf output/ \
  --dpi 300 \
  --debug

Choosing an engine

Use Mistral when:

  • You need custom prompts or schema-driven extraction
  • Tables have complex structure requiring specific instructions
  • You want fine control over extraction behavior

Use Textract when:

  • You need fast, reliable extraction on standard tables
  • You prefer managed AWS infrastructure
  • Schema customization is not required

Use Camelot when:

  • PDF is native (not scanned)
  • Tables have clear structure (borders or consistent spacing)
  • You want local, free extraction (no API costs)
  • Speed is critical for simple PDFs

Use pdfplumber when:

  • PDF can be native or scanned
  • Tables have complex structures or inconsistent borders
  • You want robust local extraction (no API costs)
  • Camelot fails to detect tables properly

Project Structure

alice-pdf/
├── alice_pdf/          # Main package source code
│   ├── cli.py          # CLI entry point and argument parsing
│   ├── extractor.py    # Mistral engine implementation
│   ├── textract_extractor.py  # AWS Textract engine
│   ├── camelot_extractor.py   # Camelot engine
│   ├── pdfplumber_extractor.py # pdfplumber engine
│   └── prompt_generator.py    # YAML schema to prompt converter
├── docs/               # Documentation
│   └── best-practices.md  # Comprehensive usage guide
├── sample/             # Example PDFs and schemas
│   ├── *.pdf           # Sample PDF files for testing
│   └── *.yaml          # Example table schemas
├── openspec/           # OpenSpec specifications
│   ├── AGENTS.md       # Agent instructions
│   └── specs/          # Change proposals and documentation
├── tests/              # Unit tests
└── tmp/                # Temporary test outputs (gitignored)

Key directories:

  • alice_pdf/: Core library code
  • docs/: User guides and best practices
  • sample/: Example files and schemas for testing
  • openspec/: Project specifications using OpenSpec format
  • tmp/: Temporary directory for test outputs (not tracked in git)

License

MIT License - Copyright (c) 2025 Andrea Borruso aborruso@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alice_pdf-0.1.3.tar.gz (4.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

alice_pdf-0.1.3-py3-none-any.whl (28.3 kB view details)

Uploaded Python 3

File details

Details for the file alice_pdf-0.1.3.tar.gz.

File metadata

  • Download URL: alice_pdf-0.1.3.tar.gz
  • Upload date:
  • Size: 4.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for alice_pdf-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f1e6376742fa3a8e37e500601f2d79e75f2c1d7e9e65fc84ed42a3b9103e051a
MD5 8ca4efeb77bfb3be9a6591bd75414a4d
BLAKE2b-256 43b1fa2b4ee32868d74ff3c1f03a7428c608c7d110a2deffbfc96b3d5988b963

See more details on using hashes here.

File details

Details for the file alice_pdf-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: alice_pdf-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 28.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.2

File hashes

Hashes for alice_pdf-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 4fac7d8565260619da3757b08e2d67bdbd7b5770c3b0cb53457e9940b806ed24
MD5 c25176b9474ed78aa80339879e50d40a
BLAKE2b-256 659a09fc4267bba0cdf438b4d2f39366a08b98bb8ecee4d7cd85a4c8b37bf31c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page