Skip to main content

PDFStract - Unified PDF Extraction & Conversion CLI + Web UI with 10+ extraction libraries

Project description

PDFStract CLI - Command-Line Interface Guide

PDFStract now includes a powerful command-line interface for PDF extraction and conversion with support for batch processing, multi-library comparison, and production automation.

Table of Contents

Installation

From PyPI (Recommended)

pip install pdfstract

From Source

git clone https://github.com/aksarav/pdfstract.git
cd pdfstract
pip install -e .
# or
uv sync

Verify Installation

pdfstract --help

Quick Start

1. List Available Libraries

pdfstract libs

Output shows which PDF extraction libraries are installed:

✓ Available: unstructured, marker, pymupdf4llm, docling, ...
✗ Unavailable: (with error reasons)

2. Convert a Single PDF

# Convert to markdown (default)
pdfstract convert document.pdf --library unstructured

# Save to file
pdfstract convert document.pdf --library unstructured --output result.md

# Different formats
pdfstract convert document.pdf --library marker --format json --output result.json
pdfstract convert document.pdf --library pymupdf4llm --format text

3. Test Multiple Libraries

pdfstract compare sample.pdf \
  -l unstructured \
  -l marker \
  -l pymupdf4llm \
  --format markdown \
  --output ./comparison_results

Results in:

comparison_results/
├─ unstructured_result.md
├─ marker_result.md
├─ pymupdf4llm_result.md
└─ comparison_summary.json

Commands

pdfstract libs

List all available PDF extraction libraries and their status.

pdfstract libs

Output:

  • Shows 10+ libraries (PyMuPDF4LLM, MarkItDown, Marker, Docling, etc.)
  • Displays availability status (✓ Available / ✗ Unavailable)
  • Shows error messages for unavailable libraries

pdfstract convert

Convert a single PDF file with a specified library.

pdfstract convert INPUT_FILE [OPTIONS]

Options:

  • -l, --library TEXT (required) - Extraction library to use
  • -f, --format [markdown|json|text] - Output format (default: markdown)
  • -o, --output PATH - Output file path (optional, prints to stdout if not specified)

Examples:

# Print to terminal
pdfstract convert sample.pdf --library unstructured

# Save to file
pdfstract convert sample.pdf --library marker --output result.md

# JSON format
pdfstract convert sample.pdf --library docling --format json --output result.json

pdfstract compare

Compare multiple extraction libraries on a single PDF to find the best one.

pdfstract compare INPUT_FILE [OPTIONS]

Options:

  • -l, --libraries TEXT (required, multiple) - Libraries to compare
  • -f, --format [markdown|json|text] - Output format (default: markdown)
  • -o, --output PATH (required) - Output directory for results

Examples:

# Compare 3 libraries
pdfstract compare sample.pdf \
  -l unstructured \
  -l marker \
  -l pymupdf4llm \
  --output ./test_results

# Compare with JSON output
pdfstract compare invoice.pdf \
  -l marker \
  -l docling \
  --format json \
  --output ./compare

Output:

  • Individual result files for each library
  • comparison_summary.json with metadata and stats

pdfstract batch

Batch convert multiple PDFs in a directory with parallel processing.

pdfstract batch INPUT_DIRECTORY [OPTIONS]

Options:

  • -l, --library TEXT (required) - Extraction library to use
  • -f, --format [markdown|json|text] - Output format (default: markdown)
  • -o, --output PATH (required) - Output directory
  • -p, --parallel INTEGER - Number of parallel workers (default: 2)
  • --pattern TEXT - File pattern to match (default: *.pdf)
  • --skip-errors - Skip PDFs that fail conversion

Examples:

# Basic batch conversion
pdfstract batch ./documents \
  --library unstructured \
  --output ./converted

# With parallel processing
pdfstract batch ./documents \
  --library marker \
  --output ./converted \
  --parallel 4

# With error handling
pdfstract batch ./pdfs \
  --library docling \
  --format json \
  --output ./converted \
  --parallel 8 \
  --skip-errors

# Custom file pattern
pdfstract batch ./invoices \
  --library unstructured \
  --pattern "invoice_*.pdf" \
  --output ./structured

Output:

output_directory/
├─ file1.md
├─ file2.md
├─ file3.md
├─ ... (more files)
└─ batch_report.json

Batch Report (batch_report.json):

{
  "input_directory": "/path/to/pdfs",
  "output_directory": "/path/to/output",
  "library": "unstructured",
  "format": "markdown",
  "total_files": 150,
  "statistics": {
    "success": 147,
    "failed": 2,
    "skipped": 1
  },
  "files": {
    "document1.pdf": {
      "status": "success",
      "size_bytes": 45230
    },
    "document2.pdf": {
      "status": "failed",
      "error": "Invalid PDF format"
    }
  }
}

pdfstract batch-compare

Compare multiple extraction libraries across an entire corpus of PDFs.

pdfstract batch-compare INPUT_DIRECTORY [OPTIONS]

Options:

  • -l, --libraries TEXT (required, multiple) - Libraries to compare
  • -f, --format [markdown|json|text] - Output format (default: markdown)
  • -o, --output PATH (required) - Output directory
  • --max-files INTEGER - Limit number of files to process

Examples:

# Compare on all PDFs
pdfstract batch-compare ./papers \
  -l marker \
  -l unstructured \
  -l pymupdf4llm \
  --output ./library_comparison

# Quick test on sample
pdfstract batch-compare ./large_corpus \
  -l marker \
  -l unstructured \
  --max-files 50 \
  --output ./sample_test

Output:

  • batch_comparison_report.json with per-library success rates
  • Per-file results for all PDFs tested

Batch Processing

When to Use Batch Processing

Batch processing is perfect for:

  • Converting 100+ PDFs with one library
  • Testing multiple libraries on entire corpus
  • Production automation jobs
  • Legacy archive digitization
  • Enterprise migrations

Parallel Processing Guidelines

Choose workers based on library and hardware:

Library CPU Usage Recommended Workers
PyMuPDF4LLM Low 8-16
MarkItDown Medium 4-8
Unstructured Medium 4-6
Marker (ML) High 2-4
OCR (Paddle/Tesseract) Very High 1-2
# Fast library, beefy server
pdfstract batch ./docs --library pymupdf4llm --parallel 16

# Slow ML library
pdfstract batch ./docs --library marker --parallel 2

# Medium library, balanced
pdfstract batch ./docs --library unstructured --parallel 6

Error Handling

Without --skip-errors (default):

  • Stops on first error
  • Exit code 1 if failures occur
  • Best for strict pipelines

With --skip-errors:

  • Continues processing all files
  • Failed files marked in report
  • Exit code 0 (always succeeds)
  • Best for best-effort processing
# Strict mode (fail on errors)
pdfstract batch ./docs --library unstructured --output ./result

# Best-effort mode (skip errors)
pdfstract batch ./docs --library unstructured --output ./result --skip-errors

Parsing Batch Reports

Use jq to analyze batch reports:

# Overall statistics
jq '.statistics' batch_report.json

# Success rate
jq '.statistics.success / .total_files * 100' batch_report.json

# Failed files only
jq '.files | to_entries[] | select(.value.status=="failed")' batch_report.json

# Average output size
jq '.files | to_entries[] | select(.value.status=="success") | .value.size_bytes' batch_report.json | \
  awk '{sum+=$1; count++} END {print sum/count/1024 " KB"}'

Real-World Examples

Example 1: Law Firm Document Digitization

Scenario: Convert 5,000 case files to searchable markdown

# Step 1: Test on sample (5 cases)
pdfstract compare case_1.pdf case_2.pdf case_3.pdf \
  -l marker \
  -l unstructured \
  -l docling \
  --output ./test_results

# Step 2: Review outputs, pick best library (e.g., docling)

# Step 3: Full batch conversion (5,000 cases)
pdfstract batch ./all_cases \
  --library docling \
  --format markdown \
  --output ./converted_cases \
  --parallel 8 \
  --skip-errors

# Step 4: Monitor results
jq '.statistics' ./converted_cases/batch_report.json

Results: 2 months manual work → 12 hours automated. $100k labor cost → $500 compute cost.


Example 2: Research Paper Quality Testing

Scenario: Find best extractor for 1,000 research papers

# Test on sample
pdfstract batch-compare ./papers \
  -l marker \
  -l unstructured \
  -l pymupdf4llm \
  --max-files 50 \
  --output ./library_test

# Review success rates, pick best

# Full batch with chosen library
pdfstract batch ./papers \
  --library marker \
  --format json \
  --output ./extracted_papers \
  --parallel 4

Example 3: Invoice Processing Pipeline

Scenario: Daily automated invoice conversion

# Create batch scheduler config
pdfstract-scheduler create daily_invoices \
  ./daily_invoices_input \
  ./daily_invoices_output \
  --library unstructured \
  --parallel 4

# Run job
pdfstract-scheduler run daily_invoices

# View results
cat ./daily_invoices_output/batch_report.json

Example 4: Legacy Archive Migration

Scenario: Modernize 50,000 legacy PDFs to JSON

pdfstract batch ./legacy_archive \
  --library marker \
  --format json \
  --output ./modern_archive \
  --parallel 16 \
  --skip-errors

# Monitor with tail
tail -f ./modern_archive/batch_report.json

Integration Examples

Bash Script: Nightly Batch Job

#!/bin/bash
DATE=$(date +%Y%m%d)
OUTPUT_DIR="./converted/$DATE"

pdfstract batch ./daily_pdfs \
  --library unstructured \
  --format markdown \
  --output "$OUTPUT_DIR" \
  --parallel 8 \
  --skip-errors

# Alert if failures
FAILED=$(jq '.statistics.failed' "$OUTPUT_DIR/batch_report.json")
if [ "$FAILED" -gt 0 ]; then
  echo "⚠️  $FAILED conversions failed on $DATE" | mail admin@company.com
fi

Python: Programmatic Usage

import subprocess
import json
from pathlib import Path

def batch_convert(pdf_dir: str, library: str, output_dir: str) -> dict:
    """Wrapper around CLI"""
    result = subprocess.run([
        'pdfstract', 'batch', pdf_dir,
        '--library', library,
        '--output', output_dir,
        '--parallel', '4'
    ])
    
    # Load and parse report
    report_file = Path(output_dir) / 'batch_report.json'
    with open(report_file) as f:
        report = json.load(f)
    
    success_rate = (report['statistics']['success'] / 
                   report['total_files'] * 100)
    print(f"Success Rate: {success_rate:.1f}%")
    return report

Docker: Containerized Processing

FROM python:3.13-slim
RUN pip install pdfstract

ENTRYPOINT ["pdfstract"]
CMD ["batch", "/data/input", "--library", "unstructured", "--output", "/data/output"]
# Build
docker build -t pdfstract .

# Run
docker run -v ./pdfs:/data/input -v ./converted:/data/output pdfstract

CI/CD: GitHub Actions Example

name: PDF Extraction

on: [push]

jobs:
  extract:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Install PDFStract
        run: pip install pdfstract
      
      - name: Extract PDFs
        run: |
          pdfstract batch ./source_pdfs \
            --library unstructured \
            --format json \
            --output ./extracted
      
      - name: Upload results
        uses: actions/upload-artifact@v2
        with:
          name: extracted-pdfs
          path: extracted/

Performance Tips

1. Test Before Large Batches

# Always test library first
pdfstract convert sample.pdf --library CHOSEN_LIB

# Then run batch
pdfstract batch ./1000_files --library CHOSEN_LIB --parallel 4

2. Choose Library Based on Speed vs Quality

  • Speed: pymupdf4llm (--parallel 16)
  • Balanced: unstructured (--parallel 6)
  • Quality: marker (--parallel 2)

3. Monitor Long-Running Jobs

pdfstract batch ./files --library marker --parallel 2 2>&1 | tee job.log

# In another terminal
tail -f job.log

4. Retry Failed Conversions

# Extract failed files from report
jq -r '.files | to_entries[] | select(.value.status=="failed") | .key' \
  batch_report.json > failed.txt

# Retry with different library
while read file; do
  pdfstract convert "$file" --library marker --output "retry_results/${file%.pdf}.md"
done < failed.txt

Troubleshooting

"Library 'X' not available"

Solution:

pdfstract libs  # See what's available

# Install missing library
uv add LIBRARY_NAME
# or
pip install LIBRARY_NAME

"File is not a PDF"

Only PDF files are supported. Check:

  • File extension is .pdf
  • File is actually a PDF (not renamed)
  • File is not corrupted

Batch Job Very Slow

Reduce parallel workers:

# Instead of --parallel 8
pdfstract batch ./docs --library marker --parallel 2

Or distribute job across multiple machines.

Memory Running Out

Reduce workers or process in smaller batches:

pdfstract batch ./docs --library marker --parallel 1

DeepSeek-OCR Not Working

Requires CUDA GPU. Alternatives:

  • PaddleOCR (no GPU needed)
  • Pytesseract (no GPU needed)
  • Unstructured (CPU-based)

Batch Job Scheduler

For recurring jobs, use the batch scheduler:

# Create scheduled job
pdfstract-scheduler create daily_job \
  ./input_dir \
  ./output_dir \
  --library unstructured \
  --parallel 4

# Run job
pdfstract-scheduler run daily_job

# View execution history
pdfstract-scheduler history daily_job

# List all jobs
pdfstract-scheduler list

Add to cron for automated scheduling:

0 2 * * * pdfstract-scheduler run daily_job

Next Steps

Support


Happy extracting! 🚀📄✨

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfstract-1.0.1.tar.gz (38.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfstract-1.0.1-py3-none-any.whl (36.1 kB view details)

Uploaded Python 3

File details

Details for the file pdfstract-1.0.1.tar.gz.

File metadata

  • Download URL: pdfstract-1.0.1.tar.gz
  • Upload date:
  • Size: 38.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdfstract-1.0.1.tar.gz
Algorithm Hash digest
SHA256 6c7cfbddd2efb02479ff72e80d4717d401cf0a8d5357fa0ce67eb93ba764ac6d
MD5 924e7caa4b6ed0076e43562a817e14ab
BLAKE2b-256 60e54ee7db0c5d486e0a7378f8d3dea50ee80cbdaef570d59b022ab54c54b3d4

See more details on using hashes here.

File details

Details for the file pdfstract-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pdfstract-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 36.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdfstract-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 41d52c1ec278010b059f56667ca3e1770c3014de4d1548b8f05ba96f0d5c67b4
MD5 53105b68cab1d7cdf00354bc8a633206
BLAKE2b-256 0e5fc175200b8d1faba93d9822d0727e5f8e6669f1e891a2d288feed00ab1253

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page