Extract tables from PDFs using Mistral OCR
Project description
Alice PDF
CLI tool to extract tables from PDFs using Camelot (default, free), Mistral OCR (Pixtral vision model), AWS Textract, or pdfplumber and convert them to machine-readable CSV files.
Dedicated to Alice Corona e Marco Corona, and the entire onData community.
Features
- Four extraction engines: Camelot (free, local, native PDFs), Mistral (schema-driven, scanned PDFs), AWS Textract (managed service), or pdfplumber (robust, works on both native and scanned PDFs)
- Extract tables from multi-page PDFs
- Support page selection (ranges or lists)
- Optional YAML schema for improved extraction accuracy (Mistral only)
- CSV output per page or merged into single file
- Configurable DPI and engine-specific options
Installation
Prerequisites: Python 3.8+.
Quick install (pip): pip install -U alice-pdf
Install globally from PyPI (choose one):
pip install alice-pdfuv tool install alice-pdf(requiresuv)
Upgrade to the latest release at any time:
pip install -U alice-pdf
# or
uv tool upgrade alice-pdf
Requirements
For Camelot engine:
- Python 3.8+
- camelot-py library (included in install)
- Works with native PDFs (not scanned images)
For Mistral engine:
- Python 3.8+
- Mistral API key (https://console.mistral.ai/)
- Best for scanned PDFs and complex tables
For pdfplumber engine:
- Python 3.8+
- pdfplumber library (included in install)
- Works on both native and scanned PDFs
- Handles complex table structures better than Camelot
- Free and local extraction
For Textract engine:
- Python 3.8+
- AWS credentials with Textract permissions
- boto3 library (included in install)
Usage
Setup
Camelot (default, no setup needed):
No API key required! Just install and use.
Mistral:
Option 1 - Environment variables (recommended for uv run):
export MISTRAL_API_KEY="your-api-key"
Option 2 - CLI parameters (recommended for uv tool install):
alice-pdf input.pdf output/ --engine mistral --api-key "your-api-key"
# alias: --mistral-api-key
Option 3 - .env file (only works with uv run, not with uv tool install):
# Create .env file in project directory
echo 'MISTRAL_API_KEY="your-api-key"' > .env
uv run alice-pdf input.pdf output/ --engine mistral
Textract:
Option 1 - Environment variables (recommended):
export AWS_ACCESS_KEY_ID="your-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="eu-west-1"
Option 2 - CLI parameters:
alice-pdf input.pdf output/ --engine textract \
--aws-region eu-west-1 \
--aws-access-key-id "your-key-id" \
--aws-secret-access-key "your-secret-key"
Costi: il motore Textract qui usa solo FeatureTypes=["TABLES"] per tenere il costo a ~0,015 USD/pagina. Il feature FORMS (~0,050 USD/pagina) non è abilitato.
Note: .env file support is only available for Mistral and only when running with uv run.
For Textract, always use environment variables or CLI parameters.
Basic commands
# Extract with Camelot (default, free, no API)
alice-pdf input.pdf output/
# Extract with Mistral (for scanned PDFs)
alice-pdf input.pdf output/ --engine mistral
# Extract with Textract
alice-pdf input.pdf output/ --engine textract --aws-region eu-west-1
# Extract with pdfplumber (robust, works on both native and scanned PDFs)
alice-pdf input.pdf output/ --engine pdfplumber
# Extract with pdfplumber with minimum table size constraints
alice-pdf input.pdf output/ --engine pdfplumber --pdfplumber-min-rows 2 --pdfplumber-min-cols 3
# Extract with Camelot (local, fast for native PDFs)
alice-pdf input.pdf output/ --engine camelot --camelot-flavor stream
# Camelot: fix for tables with merged cells
alice-pdf input.pdf output/ --engine camelot --camelot-split-text --merge
# Specific pages
alice-pdf input.pdf output/ --pages "1-3,5"
# Merge all tables into one CSV
alice-pdf input.pdf output/ --merge
# With table schema for better accuracy (Mistral only)
alice-pdf input.pdf output/ --schema table_schema.yaml
# Debug mode
alice-pdf input.pdf output/ --debug
Options
Common:
--engine {mistral,textract,camelot,pdfplumber}: Extraction engine (default: camelot)--pages: Pages to process (default: all). Examples: "1", "1-3", "1,3,5"--dpi: Image resolution (default: 150)-m, --merge: Merge all tables into single CSV--no-resume: Clear output and reprocess all pages-d, --debug: Enable debug logging
Mistral-specific:
--model: Mistral model (default: pixtral-12b-2409)--schema: Path to YAML/JSON schema file for custom prompt generation--prompt: Custom prompt (overrides --schema)--api-key: Mistral API key (alternative to env var)--timeout-ms: HTTP timeout in milliseconds (default: 60000)
Textract-specific:
--aws-region: AWS region (or set AWS_DEFAULT_REGION)--aws-access-key-id: AWS access key (or set AWS_ACCESS_KEY_ID)--aws-secret-access-key: AWS secret key (or set AWS_SECRET_ACCESS_KEY)
Camelot-specific:
--camelot-flavor {lattice,stream}: Extraction mode (default: lattice)lattice: For tables with visible bordersstream: For tables without borders (whitespace-based)
--camelot-split-text: Split text spanning multiple cells (useful for complex tables with merged cells)
pdfplumber-specific:
--pdfplumber-min-rows: Minimum number of rows for table detection (default: 1)--pdfplumber-min-cols: Minimum number of columns for table detection (default: 1)--pdfplumber-strip-text/--no-pdfplumber-strip-text: Enable/disable whitespace stripping in extracted text (default: strip)
Table Schema
To improve extraction accuracy, create a YAML file describing the table structure:
name: "housing_properties"
description: "Housing properties table"
columns:
- name: "PROPERTY"
description: "Property owner name"
examples:
- "ATER DI VENEZIA"
- "COMUNE DI VENEZIA"
- name: "UNIT"
description: "Housing unit number"
examples:
- "2950010"
- "170"
notes:
- "Keep columns separate"
- "Do NOT merge adjacent cells"
- "All rows should have exactly N columns"
How it works
Camelot engine (default)
- Converts PDF pages to raster images (150 DPI default)
- Sends images to Mistral API with structured prompt
- Mistral API (Pixtral) analyzes image and extracts tables as JSON
- Converts JSON to pandas DataFrame
- Saves CSV per page + optional merge
- Adds 'page' column for traceability
Progressive Timeout Retry:
When a page times out, the tool automatically retries with doubled timeouts:
- Attempt 1: 60 seconds (default timeout)
- Attempt 2: 120 seconds (2x timeout, if first attempt times out)
- Attempt 3: 240 seconds (4x timeout, if second attempt times out)
After 3 failed attempts, the page is skipped and processing continues with the next page. Non-timeout errors (authentication, rate limits, etc.) skip retry and move to the next page immediately.
Textract engine
- Converts PDF pages to raster images (150 DPI default)
- Sends images to AWS Textract API
- Textract analyzes document structure and extracts tables
- Converts Textract response to pandas DataFrame
- Saves CSV per page + optional merge
- Adds 'page' column for traceability
Note: Textract does not support schema/prompt customization. Use Mistral if you need custom prompts.
Camelot engine
- Reads native PDF structure (no image conversion needed)
- Detects tables using borders (
lattice) or whitespace (stream) - Converts to pandas DataFrame
- Saves CSV per page + optional merge
- Adds 'page' column for traceability
Best for: Native PDFs (not scanned) with clear table structure. Fast and free (local processing).
Output
Each extracted table is saved as:
{pdf_name}_page{N}_table{i}.csv: CSV per table{pdf_name}_merged.csv: All tables merged (if --merge)
Examples
Example 1: Basic extraction (Camelot)
alice-pdf document.pdf output/
Example 2: Mistral extraction (for scanned PDFs)
alice-pdf document.pdf output/ \
--engine mistral \
--merge
Example 3: pdfplumber extraction (robust, works on both native and scanned PDFs)
alice-pdf document.pdf output/ \
--engine pdfplumber \
--pdfplumber-min-rows 2 \
--pdfplumber-min-cols 3 \
--merge
Example 4: Textract extraction
alice-pdf document.pdf output/ \
--engine textract \
--aws-region eu-west-1 \
--merge
Example 5: Mistral with schema and merge
alice-pdf document.pdf output/ \
--engine mistral \
--schema table_schema.yaml \
--pages "2-10" \
--merge
Example 6: High resolution and debug
alice-pdf document.pdf output/ \
--dpi 300 \
--debug
Choosing an engine
Use Mistral when:
- You need custom prompts or schema-driven extraction
- Tables have complex structure requiring specific instructions
- You want fine control over extraction behavior
Use Textract when:
- You need fast, reliable extraction on standard tables
- You prefer managed AWS infrastructure
- Schema customization is not required
Use Camelot when:
- PDF is native (not scanned)
- Tables have clear structure (borders or consistent spacing)
- You want local, free extraction (no API costs)
- Speed is critical for simple PDFs
Use pdfplumber when:
- PDF can be native or scanned
- Tables have complex structures or inconsistent borders
- You want robust local extraction (no API costs)
- Camelot fails to detect tables properly
Project Structure
alice-pdf/
├── alice_pdf/ # Main package source code
│ ├── cli.py # CLI entry point and argument parsing
│ ├── extractor.py # Mistral engine implementation
│ ├── textract_extractor.py # AWS Textract engine
│ ├── camelot_extractor.py # Camelot engine
│ ├── pdfplumber_extractor.py # pdfplumber engine
│ └── prompt_generator.py # YAML schema to prompt converter
├── docs/ # Documentation
│ └── best-practices.md # Comprehensive usage guide
├── sample/ # Example PDFs and schemas
│ ├── *.pdf # Sample PDF files for testing
│ └── *.yaml # Example table schemas
├── openspec/ # OpenSpec specifications
│ ├── AGENTS.md # Agent instructions
│ └── specs/ # Change proposals and documentation
├── tests/ # Unit tests
└── tmp/ # Temporary test outputs (gitignored)
Key directories:
alice_pdf/: Core library codedocs/: User guides and best practicessample/: Example files and schemas for testingopenspec/: Project specifications using OpenSpec formattmp/: Temporary directory for test outputs (not tracked in git)
License
MIT License - Copyright (c) 2025 Andrea Borruso aborruso@gmail.com
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file alice_pdf-0.1.3.tar.gz.
File metadata
- Download URL: alice_pdf-0.1.3.tar.gz
- Upload date:
- Size: 4.0 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1e6376742fa3a8e37e500601f2d79e75f2c1d7e9e65fc84ed42a3b9103e051a
|
|
| MD5 |
8ca4efeb77bfb3be9a6591bd75414a4d
|
|
| BLAKE2b-256 |
43b1fa2b4ee32868d74ff3c1f03a7428c608c7d110a2deffbfc96b3d5988b963
|
File details
Details for the file alice_pdf-0.1.3-py3-none-any.whl.
File metadata
- Download URL: alice_pdf-0.1.3-py3-none-any.whl
- Upload date:
- Size: 28.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fac7d8565260619da3757b08e2d67bdbd7b5770c3b0cb53457e9940b806ed24
|
|
| MD5 |
c25176b9474ed78aa80339879e50d40a
|
|
| BLAKE2b-256 |
659a09fc4267bba0cdf438b4d2f39366a08b98bb8ecee4d7cd85a4c8b37bf31c
|