Convert PDF and image tables to Excel using Claude Vision API with automatic validation and multi-page merging

These details have not been verified by PyPI

Project links

Project description

PDF to XLS Vision

An intelligent Python library to convert PDF files containing tables into Excel (XLSX) files using Claude Vision API with automatic rotation detection. Each table found in the PDF becomes a separate sheet in the output Excel file.

Features

Automatic PDF type detection - Intelligently detects text-based vs image-based PDFs
Rotation detection & correction - Automatically detects and corrects rotated pages (90°, 180°, 270°)
Dual extraction modes:
- Text-based PDFs: Fast, direct extraction (free, no API needed)
- Image-based PDFs: Claude Vision API with superior accuracy
Quality validation - Automatically detects poor extraction quality and retries with Vision API
Multi-page table merging - Automatically merges tables that span multiple pages into single continuous tables
Automatic data validation - Compares extracted numbers with source PDF and generates detailed Markdown reports
Improved OCR accuracy - 4x resolution rendering and enhanced Vision API prompts for better character recognition
Incremental saving - Saves progress every 10 pages for large PDFs
Batch processing - Process entire directories with recursive scanning
Python library & CLI - Use as a library in your code or as a command-line tool
Image file support - Process image files (.jpg, .jpeg, .png, .tiff, .tif) directly

Requirements

Python 3.7+
Anthropic API key (for image-based PDFs)

Installation

Install from PyPI (Recommended)

The easiest way to install:

pip install pdf-to-xls-vision

Install from Source (for development)

# Clone the repository
git clone https://github.com/yourusername/pdf-to-xls-vision.git
cd pdf-to-xls-vision

# Install in development mode
pip install -e .

Configuration

Set up your configuration:

Copy .env.sample to .env:
```
cp .env.sample .env
```
Get your API key from: https://console.anthropic.com/
Edit the .env file and replace your-api-key-here with your actual API key:
```
ANTHROPIC_API_KEY=sk-ant-your-actual-key-here
```
(Optional) Choose a different Claude model:
```
CLAUDE_MODEL=claude-sonnet-4-5-20250929
```
Available models:
- claude-sonnet-4-5-20250929 (default, most accurate)
- claude-3-5-sonnet-20241022 (fast, cost-effective)
- claude-3-5-sonnet-20240620 (balanced)
- claude-3-opus-20240229 (highest quality, slower)

Usage

As a Python Library

from pdf_to_xls import convert_pdf_to_excel, batch_convert_directory

# Convert a single PDF
# Outputs: output.xlsx and output_validation.md
convert_pdf_to_excel('input.pdf', output_path='output.xlsx')

# Batch convert a directory
batch_convert_directory('pdfs/', output_dir='excel_files/', recursive=True)

# Force Vision API for complex tables
convert_pdf_to_excel('complex_table.pdf', force_vision=True)

# Convert image files directly
convert_pdf_to_excel('scanned_table.jpg', output_path='output.xlsx')

# Use custom API key and model
convert_pdf_to_excel(
    'input.pdf',
    api_key='your-api-key',
    model_name='claude-3-5-sonnet-20241022'
)

Output Files

Each conversion generates two files:

{filename}.xlsx - Excel file with extracted tables
{filename}_validation.md - Markdown validation report (for text-based PDFs)

See the examples/ directory for more usage examples:

basic_usage.py - Simple conversion examples
batch_processing.py - Batch processing examples
advanced_usage.py - Advanced features and error handling

As a Command-Line Tool

After installation, you can use the pdf-to-xls command:

Convert a Single PDF File

pdf-to-xls input.pdf

Output will be saved as input.xlsx in the same directory.

Specify Output Path

pdf-to-xls input.pdf -o output.xlsx

Convert All PDFs in a Directory

pdf-to-xls /path/to/pdfs

Batch Convert with Recursive Scanning

pdf-to-xls /path/to/pdfs -r -o /path/to/output

Force Vision API

pdf-to-xls input.pdf --force-vision

CLI Examples

Convert all PDFs in a directory:

pdf-to-xls "pdfs/OpStmts" -r

Convert a single file:

pdf-to-xls "pdfs/OpStmts/1206.pdf"

How It Works

Detection Phase: Analyzes the PDF to determine if it's text-based or image-based
Text-based PDFs: Uses fast, free pdfplumber extraction with quality validation
Image-based PDFs:
- Converts each page to high-resolution image (4x zoom)
- Detects rotation using Tesseract OSD
- Corrects rotation if needed
- Extracts tables using Claude Vision API with accuracy-focused prompts
- Saves progress every 10 pages
Quality Check: If text extraction has quality issues, automatically retries with Vision API
Multi-page Merging: Automatically detects and merges tables spanning multiple pages
Validation: Compares extracted numbers with source PDF and generates detailed Markdown report
Output: Creates an Excel file with merged tables and validation report

Rotation Detection

The converter automatically detects and corrects rotated pages:

Supports 90°, 180°, 270° rotations
Uses Tesseract OSD (Orientation and Script Detection)
Only corrects when confidence > 1.0
Logs each rotation correction

Example output:

Processing page 2/31 with Claude Vision...
  Detected rotation 270° (confidence: 5.3) - correcting
  ✓ Extracted table: 23 rows x 15 columns

Large PDF Support

For PDFs with 30+ pages:

Progress is saved incrementally every 10 pages
If interrupted, partial results are preserved
Visual progress indicators show completion status

Example:

Processing page 10/31...
💾 Saving progress... (10/31 pages processed)
✓ Progress saved: 10 tables

Data Validation Report

For text-based PDFs, a validation report is automatically generated to help verify accuracy:

# Data Validation Report

## Summary
| Metric | Count |
|--------|-------|
| Total numbers in PDF | 1,214 |
| Total numbers in tables | 1,382 |
| Matching numbers | 901 |
| **Accuracy** | **74.22%** |

## ⚠️ Numbers in PDF but Missing/Undercounted in Tables
| Number | PDF Count | Table Count | Difference |
|--------|-----------|-------------|------------|
|  6100.0 |         1 |           0 |          1 |
...

What it tells you:

Overall accuracy percentage
Numbers that may have been misread by OCR
Numbers that appear different counts in PDF vs tables
Helps you focus on the critical 5% that needs manual review

How to use:

Check the accuracy percentage
Review flagged numbers in the Excel output
Cross-reference with source PDF
Correct any discrepancies

API Reference

Main Functions

`convert_pdf_to_excel(pdf_path, output_path=None, output_dir=None, save_every=10, force_vision=False, api_key=None, model_name=None)`

Convert a single PDF or image file to Excel.

Parameters:

pdf_path (str|Path): Path to PDF or image file (.pdf, .jpg, .jpeg, .png, .tiff, .tif)
output_path (str|Path, optional): Output Excel file path
output_dir (str|Path, optional): Output directory
save_every (int): Save progress every N pages (default: 10)
force_vision (bool): Force Vision API even for text PDFs (default: False)
api_key (str, optional): Anthropic API key (uses env var if not provided)
model_name (str, optional): Claude model name (uses env var if not provided)

Returns: Path to created Excel file, or None if no tables found

Outputs:

{filename}.xlsx - Excel file with extracted tables
{filename}_validation.md - Validation report (text-based PDFs only)

Raises:

FileNotFoundError: If file does not exist
ValueError: If API key is required but not found

`batch_convert_directory(input_dir, output_dir=None, recursive=False, force_vision=False, api_key=None, model_name=None)`

Batch convert PDFs in a directory.

Parameters:

input_dir (str|Path): Directory containing PDF files
output_dir (str|Path, optional): Output directory
recursive (bool): Recursively search subdirectories (default: False)
force_vision (bool): Force Vision API for all PDFs (default: False)
api_key (str, optional): Anthropic API key
model_name (str, optional): Claude model name

Returns: Dictionary with 'success' and 'failed' lists of file paths

Raises:

FileNotFoundError: If input directory does not exist

Utility Functions

`pdf_is_image_based(pdf_path)`

Check if PDF is image-based (contains images).

Parameters:

pdf_path (str|Path): Path to PDF file

Returns: bool - True if PDF is image-based

`pdf_has_text(pdf_path)`

Check if PDF has extractable text.

Parameters:

pdf_path (str|Path): Path to PDF file

Returns: bool - True if PDF has extractable text

`detect_quality_issues(table_data)`

Detect quality issues in extracted table data.

Parameters:

table_data: DataFrame or raw table data

Returns: list - List of quality issue descriptions

Cost Information

Text-based PDFs: Free (no API calls)
Image-based PDFs: ~$0.01-0.05 per page with Claude Vision API
The tool automatically chooses the most cost-effective method

Troubleshooting

PDF not converting properly?

The tool automatically detects and uses the best method
Check that your .env file has a valid API key for image-based PDFs
Make sure the PDF isn't password-protected
Try --force-vision flag for complex table layouts

Process taking too long?

Large image-based PDFs (30+ pages) may take 15-25 minutes
Progress is saved every 10 pages
Check for incremental save messages

Rotation issues?

Rotation detection requires Tesseract OCR to be installed
Install via: brew install tesseract (Mac) or apt-get install tesseract-ocr (Linux)

Import errors?

Make sure you installed the package: pip install -e .
Check that all dependencies are installed: pip install -r requirements.txt

Development

Project Structure

pdf-to-xls-vision/
├── pdf_to_xls/              # Main package
│   ├── __init__.py         # Public API
│   ├── config.py           # Configuration management
│   ├── converter.py        # Main conversion functions
│   ├── data_cleaning.py    # Data cleaning utilities
│   ├── excel_writer.py     # Excel generation
│   ├── image_processing.py # Image conversion and rotation
│   ├── pdf_detection.py    # PDF type detection
│   ├── quality_check.py    # Quality validation
│   └── table_extraction.py # Table extraction (vision & text)
├── examples/               # Usage examples
│   ├── basic_usage.py
│   ├── batch_processing.py
│   └── advanced_usage.py
├── pdf_to_xls_cli.py      # CLI entry point
├── setup.py               # Package setup
├── pyproject.toml         # Modern Python packaging
├── requirements.txt       # Dependencies
├── README.md             # This file
└── LICENSE               # License file

Running Tests

# Install development dependencies
pip install -e ".[dev]"

# Run tests (when test suite is added)
pytest

Table Structure

Extracted tables use a simple, consistent structure:

Row_Type	Category	2020	2019
HEADER	REVENUES
DETAIL	Gross rental income	458,963	452,477
DETAIL	Vacancy loss	(21,862)	(18,065)
ROLLUP	Total revenues	421,934	408,059

Row Types:

HEADER - Section/category headers
DETAIL - Individual line items
ROLLUP - Total/summary rows

Multi-page Tables: Tables that span multiple pages are automatically detected and merged into a single continuous table.

Technical Details

Uses pdfplumber for text extraction
Uses pytesseract for rotation detection
Uses Claude Vision API (Sonnet 4.5) for image-based extraction
Uses openpyxl for Excel file generation
4x resolution rendering (3368x2380 pixels) for optimal OCR accuracy
Automatic quality validation and retry logic
Automatic multi-page table continuation detection and merging
Post-extraction number validation and discrepancy reporting

License

MIT License - see LICENSE file for details

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Accuracy and Limitations

Expected Accuracy:

Text-based PDFs with simple tables: ~95-99%
Image-based PDFs with complex tables: ~85-95%
Wide tables (12+ columns) with small text: ~70-90%

Known Limitations:

OCR Errors: Vision API may misread similar characters (6 vs 8, O vs 0)
Complex Layouts: Tables with merged cells or irregular structures may not extract perfectly
Image Quality: Low-resolution source PDFs reduce accuracy
Text-only Validation: Validation reports only work for text-based PDFs

Best Practices:

✅ Always review the validation report
✅ Manually verify critical numbers (especially financial data)
✅ Use high-quality source PDFs when possible
✅ For mission-critical accuracy, consider human verification of flagged numbers

Changelog

Version 1.0.4

Multi-page table merging - Automatically detects and merges continuation tables
Data validation reports - Generates Markdown reports comparing PDF vs extracted numbers
Improved OCR accuracy - 4x resolution rendering, enhanced Vision API prompts
Single Category column - Simplified table structure for easier downstream processing
Generic header detection - Supports both "Col1" and "Column1" header patterns
Debug logging - Added image size tracking for troubleshooting
Image file support - Process .jpg, .jpeg, .png, .tiff, .tif files directly

Version 1.0.3

Fix image size limit error for Claude API

Version 1.0.2

Add support for image file inputs

Version 1.0.1

Bug fixes and improvements

Version 1.0.0

Initial release with library structure
Modular package design
Python library API
Command-line interface
Automatic PDF type detection
Vision API with rotation correction
Quality validation and auto-retry
Batch processing support
Comprehensive examples

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.6

Nov 1, 2025

1.0.5

Oct 29, 2025

1.0.4

Oct 29, 2025

1.0.3

Oct 29, 2025

1.0.2

Oct 28, 2025

1.0.1

Oct 27, 2025

1.0.0

Oct 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_to_xls_vision-1.0.6.tar.gz (37.5 kB view details)

Uploaded Nov 1, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf_to_xls_vision-1.0.6-py3-none-any.whl (32.0 kB view details)

Uploaded Nov 1, 2025 Python 3

File details

Details for the file pdf_to_xls_vision-1.0.6.tar.gz.

File metadata

Download URL: pdf_to_xls_vision-1.0.6.tar.gz
Upload date: Nov 1, 2025
Size: 37.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pdf_to_xls_vision-1.0.6.tar.gz
Algorithm	Hash digest
SHA256	`a74f65c9bd6e9d1f1e6d6f14417a020ac1c8bd80d9b4f51d95e72b5c1e1921c4`
MD5	`3555c5584bf7b0ec142261f76604b1b5`
BLAKE2b-256	`68fe77112a3201a008c027ab716b54a0f75b8a206308b5adb7a22ab1b8c5aaa4`

See more details on using hashes here.

File details

Details for the file pdf_to_xls_vision-1.0.6-py3-none-any.whl.

File metadata

Download URL: pdf_to_xls_vision-1.0.6-py3-none-any.whl
Upload date: Nov 1, 2025
Size: 32.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pdf_to_xls_vision-1.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`18ed296b2710d0caf31baf000c6d0f3ceb890a9bb88fba69c54c925b72189987`
MD5	`cf0b744c7148e9f81b139aaccf9fd6d1`
BLAKE2b-256	`d7cf48262faee7640e19edf377a24d65cdd16252e90dae6724aff5b60e93470c`

See more details on using hashes here.

pdf-to-xls-vision 1.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PDF to XLS Vision

Features

Requirements

Installation

Install from PyPI (Recommended)

Install from Source (for development)

Configuration

Usage

As a Python Library

Output Files

As a Command-Line Tool

Convert a Single PDF File

Specify Output Path

Convert All PDFs in a Directory

Batch Convert with Recursive Scanning

Force Vision API

CLI Examples

How It Works

Rotation Detection

Large PDF Support

Data Validation Report

API Reference

Main Functions

convert_pdf_to_excel(pdf_path, output_path=None, output_dir=None, save_every=10, force_vision=False, api_key=None, model_name=None)

batch_convert_directory(input_dir, output_dir=None, recursive=False, force_vision=False, api_key=None, model_name=None)

Utility Functions

pdf_is_image_based(pdf_path)

pdf_has_text(pdf_path)

detect_quality_issues(table_data)

Cost Information

Troubleshooting

Development

Project Structure

Running Tests

Table Structure

Technical Details

License

Contributing

Accuracy and Limitations

Changelog

Version 1.0.4

Version 1.0.3

Version 1.0.2

Version 1.0.1

Version 1.0.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`convert_pdf_to_excel(pdf_path, output_path=None, output_dir=None, save_every=10, force_vision=False, api_key=None, model_name=None)`

`batch_convert_directory(input_dir, output_dir=None, recursive=False, force_vision=False, api_key=None, model_name=None)`

`pdf_is_image_based(pdf_path)`

`pdf_has_text(pdf_path)`

`detect_quality_issues(table_data)`