tabuparse

A Python CLI tool for extracting, normalizing, and merging tabular data from PDF documents

These details have not been verified by PyPI

Project links

Project description

extract, transform and export PDF tabular data

Python 3.10+ asyncio support License: MIT

About

tabuparse is a Python CLI tool and library for extracting, normalizing, and merging tabular data from PDF documents.

Installation

[!WARNING] This project is still in alpha mode and might go sideways.

From source

git clone https://github.com/lupeke/tabuparse.git && \
cd tabuparse && \
python3 -m venv .venv && source .venv/bin/activate && \
pip install -e .

Run a health check

python tests/check_install.py

Quick start

CLI usage

# Process single PDF with default settings
tabuparse process example.pdf

# Process multiple PDFs with configuration
tabuparse process *.pdf --config settings.toml --output data.csv

# Export to SQLite with summary statistics
tabuparse process documents/*.pdf --format sqlite --summary

# Preview processing without extraction
tabuparse preview *.pdf --config settings.toml

# Extract from single PDF for testing
tabuparse extract document.pdf --pages "1-3" --flavor stream

Library usage

import asyncio
from tabuparse import process_pdfs

async def main():
    # Process PDFs and get merged DataFrame
    result_df = await process_pdfs(
        pdf_paths=['invoice1.pdf', 'invoice2.pdf'],
        config_path='schema.toml',
        output_format='csv'
    )

    print(f"Extracted {len(result_df)} rows")
    print(result_df.head())

asyncio.run(main())

Configuration

tabuparse uses TOML configuration files to define extraction parameters and expected schemas.

generate sample configuration

tabuparse init-config settings.toml --columns "Invoice ID,Date,Amount,Description"

configuration structure

# settings.toml
[table_structure]
expected_columns = [
    "Invoice ID",
    "Date",
    "Item Description",
    "Quantity",
    "Unit Price",
    "Total Amount"
]

[settings]
output_format = "csv"
strict_schema = false

[default_extraction]
flavor = "lattice"
pages = "all"

# PDF-specific extraction parameters
[[extraction_parameters]]
pdf_path = "invoice_batch_1.pdf"
pages = "1-5"
flavor = "lattice"

[[extraction_parameters]]
pdf_path = "statements.pdf"
pages = "all"
flavor = "stream"
table_areas = ["72,72,432,648"]  # left,bottom,right,top in points

Configuration Options

table structure

expected_columns: List of column names for schema normalization

settings

output_format: "csv" or "sqlite"
strict_schema: Enable strict schema validation (fail on mismatches)

extraction parameters

pages: Page selection ("all", "1", "1,3,5", "1-3")
flavor: Camelot extraction method ("lattice" or "stream")
table_areas: Specific table regions to extract
pdf_path: Apply parameters to specific PDF files

CLI Commands

`process`

Extract and merge tables from multiple PDF files.

tabuparse process file1.pdf file2.pdf [OPTIONS]

Options:
  -c, --config PATH       TOML configuration file
  -o, --output PATH       Output file path
  --format [csv|sqlite]   Output format (default: csv)
  --max-concurrent INT    Max concurrent extractions (default: 5)
  --summary              Export summary statistics
  --no-clean             Disable data cleaning
  --strict               Enable strict schema validation

`extract`

Extract tables from a single PDF (for testing).

tabuparse extract document.pdf [OPTIONS]

Options:
  -c, --config PATH              Configuration file
  --pages TEXT                   Pages to extract
  --flavor [lattice|stream]      Extraction method
  --show-info                    Show detailed table information

`preview`

Preview processing statistics without extraction.

tabuparse preview file1.pdf file2.pdf [OPTIONS]

Options:
  -c, --config PATH       Configuration file

`init-config`

Generate sample configuration file.

tabuparse init-config config.toml [OPTIONS]

Options:
  --columns TEXT                 Expected column names (comma-separated)
  --format [csv|sqlite]          Default output format
  --flavor [lattice|stream]      Default extraction flavor

`validate`

Validate PDF file compatibility.

tabuparse validate document.pdf

Library API

core functions

from tabuparse import process_pdfs, extract_from_single_pdf

# Process multiple PDFs
result_df = await process_pdfs(
    pdf_paths=['file1.pdf', 'file2.pdf'],
    config_path='settings.toml',
    output_path='output.csv',
    output_format='csv',
    max_concurrent=5
)

# Extract from single PDF
tables = await extract_from_single_pdf(
    'document.pdf',
    config_path='settings.toml'
)

configuration management

from tabuparse.config_parser import parse_config, TabuparseConfig

# Load configuration
config = parse_config('settings.toml')

# Create programmatic configuration
config = TabuparseConfig(
    expected_columns=['ID', 'Name', 'Amount'],
    output_format='sqlite'
)

data processing

from tabuparse.data_processor import normalize_schema, merge_dataframes

# Normalize DataFrame schema
normalized_df = normalize_schema(
    df,
    expected_columns=['ID', 'Name', 'Amount'],
    strict_mode=False
)

# Merge multiple DataFrames
merged_df = merge_dataframes([df1, df2, df3])

Examples

basic invoice processing

# Process invoice PDFs with predefined schema
tabuparse process invoices/*.pdf --config invoice_schema.toml --output invoices.csv

multi-format export

import asyncio
from tabuparse import process_pdfs

async def process_financial_data():
    # Extract data
    df = await process_pdfs(
        pdf_paths=['q1_report.pdf', 'q2_report.pdf'],
        config_path='financial_schema.toml'
    )

    # Export to multiple formats
    df.to_csv('financial_data.csv', index=False)
    df.to_excel('financial_data.xlsx', index=False)

    return df

asyncio.run(process_financial_data())

custom processing pipeline

from tabuparse.pdf_extractor import extract_tables_from_pdf
from tabuparse.data_processor import normalize_schema
from tabuparse.output_writer import write_sqlite

async def custom_pipeline():
    # Extract tables
    tables = await extract_tables_from_pdf('document.pdf')

    # Process each table
    processed_tables = []
    for table in tables:
        normalized = normalize_schema(
            table,
            expected_columns=['ID', 'Date', 'Amount']
        )
        processed_tables.append(normalized)

    # Merge and export
    import pandas as pd
    merged = pd.concat(processed_tables, ignore_index=True)
    write_sqlite(merged, 'output.sqlite', table_name='extracted_data')

asyncio.run(custom_pipeline())

Samplings icons by Afian Rochmah Afif - Flaticon

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tabuparse-0.1.0.tar.gz (27.4 kB view details)

Uploaded Jul 11, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tabuparse-0.1.0-py3-none-any.whl (26.5 kB view details)

Uploaded Jul 11, 2025 Python 3

File details

Details for the file tabuparse-0.1.0.tar.gz.

File metadata

Download URL: tabuparse-0.1.0.tar.gz
Upload date: Jul 11, 2025
Size: 27.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for tabuparse-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`69ea91b2707a709d9ccf9df2fc2479a3d58b33cb0188883f3801400a4d4e6247`
MD5	`3ce6bcea55fc4ce11832920bfd36a7cd`
BLAKE2b-256	`a79abd698be5772364838f45088a4df2b831f46a50952c05c746d0ee7663dca3`

See more details on using hashes here.

File details

Details for the file tabuparse-0.1.0-py3-none-any.whl.

File metadata

Download URL: tabuparse-0.1.0-py3-none-any.whl
Upload date: Jul 11, 2025
Size: 26.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for tabuparse-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`43659b6aefe330ba367e59a1a38c3d0185b7bc633d6439f6022b157e7d74a21f`
MD5	`6654b0c9a621bf021c4cf2c77e0bd168`
BLAKE2b-256	`36f84e268a367b948410b5e56c2b3dc8d55538e5f74d28107b74608efb0e090d`

See more details on using hashes here.

tabuparse 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

About

Installation

From source

Run a health check

Quick start

CLI usage

Library usage

Configuration

generate sample configuration

configuration structure

Configuration Options

table structure

settings

extraction parameters

CLI Commands

process

extract

preview

init-config

validate

Library API

core functions

configuration management

data processing

Examples

basic invoice processing

multi-format export

custom processing pipeline

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`process`

`extract`

`preview`

`init-config`

`validate`