Skip to main content

A Python CLI tool for extracting, normalizing, and merging tabular data from PDF documents

Project description

tabuparse

extract, transform and export PDF tabular data

Python 3.10+ asyncio support License: MIT

About

tabuparse is a Python CLI tool and library for extracting, normalizing, and merging tabular data from PDF documents.

Installation

[!WARNING] This project is still in alpha mode and might go sideways.

From source

git clone https://github.com/lupeke/tabuparse.git && \
cd tabuparse && \
python3 -m venv .venv && source .venv/bin/activate && \
pip install -e .

Run a health check

python tests/check_install.py

Quick start

CLI usage

# Process single PDF with default settings
tabuparse process example.pdf

# Process multiple PDFs with configuration
tabuparse process *.pdf --config settings.toml --output data.csv

# Export to SQLite with summary statistics
tabuparse process documents/*.pdf --format sqlite --summary

# Preview processing without extraction
tabuparse preview *.pdf --config settings.toml

# Extract from single PDF for testing
tabuparse extract document.pdf --pages "1-3" --flavor stream

Library usage

import asyncio
from tabuparse import process_pdfs

async def main():
    # Process PDFs and get merged DataFrame
    result_df = await process_pdfs(
        pdf_paths=['invoice1.pdf', 'invoice2.pdf'],
        config_path='schema.toml',
        output_format='csv'
    )

    print(f"Extracted {len(result_df)} rows")
    print(result_df.head())

asyncio.run(main())

Configuration

tabuparse uses TOML configuration files to define extraction parameters and expected schemas.

generate sample configuration

tabuparse init-config settings.toml --columns "Invoice ID,Date,Amount,Description"

configuration structure

# settings.toml
[table_structure]
expected_columns = [
    "Invoice ID",
    "Date",
    "Item Description",
    "Quantity",
    "Unit Price",
    "Total Amount"
]

[settings]
output_format = "csv"
strict_schema = false

[default_extraction]
flavor = "lattice"
pages = "all"

# PDF-specific extraction parameters
[[extraction_parameters]]
pdf_path = "invoice_batch_1.pdf"
pages = "1-5"
flavor = "lattice"

[[extraction_parameters]]
pdf_path = "statements.pdf"
pages = "all"
flavor = "stream"
table_areas = ["72,72,432,648"]  # left,bottom,right,top in points

Configuration Options

table structure

  • expected_columns: List of column names for schema normalization

settings

  • output_format: "csv" or "sqlite"
  • strict_schema: Enable strict schema validation (fail on mismatches)

extraction parameters

  • pages: Page selection ("all", "1", "1,3,5", "1-3")
  • flavor: Camelot extraction method ("lattice" or "stream")
  • table_areas: Specific table regions to extract
  • pdf_path: Apply parameters to specific PDF files

CLI Commands

process

Extract and merge tables from multiple PDF files.

tabuparse process file1.pdf file2.pdf [OPTIONS]

Options:
  -c, --config PATH       TOML configuration file
  -o, --output PATH       Output file path
  --format [csv|sqlite]   Output format (default: csv)
  --max-concurrent INT    Max concurrent extractions (default: 5)
  --summary              Export summary statistics
  --no-clean             Disable data cleaning
  --strict               Enable strict schema validation

extract

Extract tables from a single PDF (for testing).

tabuparse extract document.pdf [OPTIONS]

Options:
  -c, --config PATH              Configuration file
  --pages TEXT                   Pages to extract
  --flavor [lattice|stream]      Extraction method
  --show-info                    Show detailed table information

preview

Preview processing statistics without extraction.

tabuparse preview file1.pdf file2.pdf [OPTIONS]

Options:
  -c, --config PATH       Configuration file

init-config

Generate sample configuration file.

tabuparse init-config config.toml [OPTIONS]

Options:
  --columns TEXT                 Expected column names (comma-separated)
  --format [csv|sqlite]          Default output format
  --flavor [lattice|stream]      Default extraction flavor

validate

Validate PDF file compatibility.

tabuparse validate document.pdf

Library API

core functions

from tabuparse import process_pdfs, extract_from_single_pdf

# Process multiple PDFs
result_df = await process_pdfs(
    pdf_paths=['file1.pdf', 'file2.pdf'],
    config_path='settings.toml',
    output_path='output.csv',
    output_format='csv',
    max_concurrent=5
)

# Extract from single PDF
tables = await extract_from_single_pdf(
    'document.pdf',
    config_path='settings.toml'
)

configuration management

from tabuparse.config_parser import parse_config, TabuparseConfig

# Load configuration
config = parse_config('settings.toml')

# Create programmatic configuration
config = TabuparseConfig(
    expected_columns=['ID', 'Name', 'Amount'],
    output_format='sqlite'
)

data processing

from tabuparse.data_processor import normalize_schema, merge_dataframes

# Normalize DataFrame schema
normalized_df = normalize_schema(
    df,
    expected_columns=['ID', 'Name', 'Amount'],
    strict_mode=False
)

# Merge multiple DataFrames
merged_df = merge_dataframes([df1, df2, df3])

Examples

basic invoice processing

# Process invoice PDFs with predefined schema
tabuparse process invoices/*.pdf --config invoice_schema.toml --output invoices.csv

multi-format export

import asyncio
from tabuparse import process_pdfs

async def process_financial_data():
    # Extract data
    df = await process_pdfs(
        pdf_paths=['q1_report.pdf', 'q2_report.pdf'],
        config_path='financial_schema.toml'
    )

    # Export to multiple formats
    df.to_csv('financial_data.csv', index=False)
    df.to_excel('financial_data.xlsx', index=False)

    return df

asyncio.run(process_financial_data())

custom processing pipeline

from tabuparse.pdf_extractor import extract_tables_from_pdf
from tabuparse.data_processor import normalize_schema
from tabuparse.output_writer import write_sqlite

async def custom_pipeline():
    # Extract tables
    tables = await extract_tables_from_pdf('document.pdf')

    # Process each table
    processed_tables = []
    for table in tables:
        normalized = normalize_schema(
            table,
            expected_columns=['ID', 'Date', 'Amount']
        )
        processed_tables.append(normalized)

    # Merge and export
    import pandas as pd
    merged = pd.concat(processed_tables, ignore_index=True)
    write_sqlite(merged, 'output.sqlite', table_name='extracted_data')

asyncio.run(custom_pipeline())



Samplings icons by Afian Rochmah Afif - Flaticon

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabuparse-0.1.0.tar.gz (27.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tabuparse-0.1.0-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file tabuparse-0.1.0.tar.gz.

File metadata

  • Download URL: tabuparse-0.1.0.tar.gz
  • Upload date:
  • Size: 27.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for tabuparse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 69ea91b2707a709d9ccf9df2fc2479a3d58b33cb0188883f3801400a4d4e6247
MD5 3ce6bcea55fc4ce11832920bfd36a7cd
BLAKE2b-256 a79abd698be5772364838f45088a4df2b831f46a50952c05c746d0ee7663dca3

See more details on using hashes here.

File details

Details for the file tabuparse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tabuparse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for tabuparse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 43659b6aefe330ba367e59a1a38c3d0185b7bc633d6439f6022b157e7d74a21f
MD5 6654b0c9a621bf021c4cf2c77e0bd168
BLAKE2b-256 36f84e268a367b948410b5e56c2b3dc8d55538e5f74d28107b74608efb0e090d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page