PDFStract - Unified PDF Extraction & Conversion CLI + Web UI with 10+ extraction libraries
Project description
PDFStract CLI - Command-Line Interface Guide
PDFStract now includes a powerful command-line interface for PDF extraction and conversion with support for batch processing, multi-library comparison, and production automation.
Table of Contents
- Installation
- Performance & Startup Time
- Quick Start
- Commands
- Batch Processing
- Real-World Examples
- Integration Examples
- Performance Tips
- Troubleshooting
Installation
From PyPI (Recommended)
pip install pdfstract
From Source
git clone https://github.com/aksarav/pdfstract.git
cd pdfstract
pip install -e .
# or
uv sync
Verify Installation
pdfstract --help
⚡ Performance & Startup Time
Understanding CLI Startup
PDFStract CLI uses lazy loading to keep startup times fast:
| Command | Startup Time | Why |
|---|---|---|
pdfstract --help |
< 1s ⚡ | Only loads Click framework, no libraries |
pdfstract convert --help |
< 1s ⚡ | No library checking, just shows usage |
pdfstract libs |
8-10s | Checks all 8 libraries (one-time) |
pdfstract convert file.pdf -l X |
3-8s | Loads only library X, not others |
pdfstract batch ./docs -l X |
3-8s | Loads only library X for bulk processing |
Why Is First Use Slow?
The 3-8s delay on first conversion is NOT a PDFStract issue - it's because the PDF libraries themselves are heavy:
- torch (~2-3s) - Machine learning framework
- paddleocr/marker/docling (~3-5s) - Load ML models on first use
- Python imports (~1-2s) - Module loading overhead
This is a one-time cost per Python session. After first use, conversions run faster because libraries stay loaded.
How to Keep CLI Fast
-
Help commands are instant:
pdfstract --help # Instant pdfstract convert --help # Instant pdfstract batch --help # Instant
-
Use fast libraries for quick conversions:
pdfstract convert file.pdf --library pymupdf4llm # Fastest (~3s) pdfstract convert file.pdf --library markitdown # Fast (~4s) pdfstract convert file.pdf --library marker # Quality but slower (~8s)
-
Batch processing amortizes startup cost:
# First file: 8s (includes startup) # Files 2-100: ~1-2s each (library already loaded) pdfstract batch ./docs --library unstructured --parallel 4
-
For production scripts, keep process alive:
# Bad: Multiple 8s startups for file in *.pdf; do pdfstract convert "$file" -l unstructured; done # Better: Single 8s startup, then fast conversions python -c " from services.cli_factory import CLILazyFactory factory = CLILazyFactory() for file in glob('*.pdf'): factory.convert('unstructured', file) "
Quick Start
1. List Available Libraries
pdfstract libs
Output shows which PDF extraction libraries are installed:
✓ Available: unstructured, marker, pymupdf4llm, docling, ...
✗ Unavailable: (with error reasons)
2. Convert a Single PDF
# Convert to markdown (default)
pdfstract convert document.pdf --library unstructured
# Save to file
pdfstract convert document.pdf --library unstructured --output result.md
# Different formats
pdfstract convert document.pdf --library marker --format json --output result.json
pdfstract convert document.pdf --library pymupdf4llm --format text
3. Test Multiple Libraries
pdfstract compare sample.pdf \
-l unstructured \
-l marker \
-l pymupdf4llm \
--format markdown \
--output ./comparison_results
Results in:
comparison_results/
├─ unstructured_result.md
├─ marker_result.md
├─ pymupdf4llm_result.md
└─ comparison_summary.json
Commands
pdfstract libs
List all available PDF extraction libraries and their status.
pdfstract libs
Output:
- Shows 10+ libraries (PyMuPDF4LLM, MarkItDown, Marker, Docling, etc.)
- Displays availability status (✓ Available / ✗ Unavailable)
- Shows error messages for unavailable libraries
pdfstract convert
Convert a single PDF file with a specified library.
pdfstract convert INPUT_FILE [OPTIONS]
Options:
-l, --library TEXT(required) - Extraction library to use-f, --format [markdown|json|text]- Output format (default: markdown)-o, --output PATH- Output file path (optional, prints to stdout if not specified)
Examples:
# Print to terminal
pdfstract convert sample.pdf --library unstructured
# Save to file
pdfstract convert sample.pdf --library marker --output result.md
# JSON format
pdfstract convert sample.pdf --library docling --format json --output result.json
pdfstract compare
Compare multiple extraction libraries on a single PDF to find the best one.
pdfstract compare INPUT_FILE [OPTIONS]
Options:
-l, --libraries TEXT(required, multiple) - Libraries to compare-f, --format [markdown|json|text]- Output format (default: markdown)-o, --output PATH(required) - Output directory for results
Examples:
# Compare 3 libraries
pdfstract compare sample.pdf \
-l unstructured \
-l marker \
-l pymupdf4llm \
--output ./test_results
# Compare with JSON output
pdfstract compare invoice.pdf \
-l marker \
-l docling \
--format json \
--output ./compare
Output:
- Individual result files for each library
comparison_summary.jsonwith metadata and stats
pdfstract batch
Batch convert multiple PDFs in a directory with parallel processing.
pdfstract batch INPUT_DIRECTORY [OPTIONS]
Options:
-l, --library TEXT(required) - Extraction library to use-f, --format [markdown|json|text]- Output format (default: markdown)-o, --output PATH(required) - Output directory-p, --parallel INTEGER- Number of parallel workers (default: 2)--pattern TEXT- File pattern to match (default: *.pdf)--skip-errors- Skip PDFs that fail conversion
Examples:
# Basic batch conversion
pdfstract batch ./documents \
--library unstructured \
--output ./converted
# With parallel processing
pdfstract batch ./documents \
--library marker \
--output ./converted \
--parallel 4
# With error handling
pdfstract batch ./pdfs \
--library docling \
--format json \
--output ./converted \
--parallel 8 \
--skip-errors
# Custom file pattern
pdfstract batch ./invoices \
--library unstructured \
--pattern "invoice_*.pdf" \
--output ./structured
Output:
output_directory/
├─ file1.md
├─ file2.md
├─ file3.md
├─ ... (more files)
└─ batch_report.json
Batch Report (batch_report.json):
{
"input_directory": "/path/to/pdfs",
"output_directory": "/path/to/output",
"library": "unstructured",
"format": "markdown",
"total_files": 150,
"statistics": {
"success": 147,
"failed": 2,
"skipped": 1
},
"files": {
"document1.pdf": {
"status": "success",
"size_bytes": 45230
},
"document2.pdf": {
"status": "failed",
"error": "Invalid PDF format"
}
}
}
pdfstract batch-compare
Compare multiple extraction libraries across an entire corpus of PDFs.
pdfstract batch-compare INPUT_DIRECTORY [OPTIONS]
Options:
-l, --libraries TEXT(required, multiple) - Libraries to compare-f, --format [markdown|json|text]- Output format (default: markdown)-o, --output PATH(required) - Output directory--max-files INTEGER- Limit number of files to process
Examples:
# Compare on all PDFs
pdfstract batch-compare ./papers \
-l marker \
-l unstructured \
-l pymupdf4llm \
--output ./library_comparison
# Quick test on sample
pdfstract batch-compare ./large_corpus \
-l marker \
-l unstructured \
--max-files 50 \
--output ./sample_test
Output:
batch_comparison_report.jsonwith per-library success rates- Per-file results for all PDFs tested
Batch Processing
When to Use Batch Processing
Batch processing is perfect for:
- Converting 100+ PDFs with one library
- Testing multiple libraries on entire corpus
- Production automation jobs
- Legacy archive digitization
- Enterprise migrations
Parallel Processing Guidelines
Choose workers based on library and hardware:
| Library | CPU Usage | Recommended Workers |
|---|---|---|
| PyMuPDF4LLM | Low | 8-16 |
| MarkItDown | Medium | 4-8 |
| Unstructured | Medium | 4-6 |
| Marker (ML) | High | 2-4 |
| OCR (Paddle/Tesseract) | Very High | 1-2 |
# Fast library, beefy server
pdfstract batch ./docs --library pymupdf4llm --parallel 16
# Slow ML library
pdfstract batch ./docs --library marker --parallel 2
# Medium library, balanced
pdfstract batch ./docs --library unstructured --parallel 6
Error Handling
Without --skip-errors (default):
- Stops on first error
- Exit code 1 if failures occur
- Best for strict pipelines
With --skip-errors:
- Continues processing all files
- Failed files marked in report
- Exit code 0 (always succeeds)
- Best for best-effort processing
# Strict mode (fail on errors)
pdfstract batch ./docs --library unstructured --output ./result
# Best-effort mode (skip errors)
pdfstract batch ./docs --library unstructured --output ./result --skip-errors
Parsing Batch Reports
Use jq to analyze batch reports:
# Overall statistics
jq '.statistics' batch_report.json
# Success rate
jq '.statistics.success / .total_files * 100' batch_report.json
# Failed files only
jq '.files | to_entries[] | select(.value.status=="failed")' batch_report.json
# Average output size
jq '.files | to_entries[] | select(.value.status=="success") | .value.size_bytes' batch_report.json | \
awk '{sum+=$1; count++} END {print sum/count/1024 " KB"}'
Real-World Examples
Example 1: Law Firm Document Digitization
Scenario: Convert 5,000 case files to searchable markdown
# Step 1: Test on sample (5 cases)
pdfstract compare case_1.pdf case_2.pdf case_3.pdf \
-l marker \
-l unstructured \
-l docling \
--output ./test_results
# Step 2: Review outputs, pick best library (e.g., docling)
# Step 3: Full batch conversion (5,000 cases)
pdfstract batch ./all_cases \
--library docling \
--format markdown \
--output ./converted_cases \
--parallel 8 \
--skip-errors
# Step 4: Monitor results
jq '.statistics' ./converted_cases/batch_report.json
Results: 2 months manual work → 12 hours automated. $100k labor cost → $500 compute cost.
Example 2: Research Paper Quality Testing
Scenario: Find best extractor for 1,000 research papers
# Test on sample
pdfstract batch-compare ./papers \
-l marker \
-l unstructured \
-l pymupdf4llm \
--max-files 50 \
--output ./library_test
# Review success rates, pick best
# Full batch with chosen library
pdfstract batch ./papers \
--library marker \
--format json \
--output ./extracted_papers \
--parallel 4
Example 3: Invoice Processing Pipeline
Scenario: Daily automated invoice conversion
# Create batch scheduler config
pdfstract-scheduler create daily_invoices \
./daily_invoices_input \
./daily_invoices_output \
--library unstructured \
--parallel 4
# Run job
pdfstract-scheduler run daily_invoices
# View results
cat ./daily_invoices_output/batch_report.json
Example 4: Legacy Archive Migration
Scenario: Modernize 50,000 legacy PDFs to JSON
pdfstract batch ./legacy_archive \
--library marker \
--format json \
--output ./modern_archive \
--parallel 16 \
--skip-errors
# Monitor with tail
tail -f ./modern_archive/batch_report.json
Integration Examples
Bash Script: Nightly Batch Job
#!/bin/bash
DATE=$(date +%Y%m%d)
OUTPUT_DIR="./converted/$DATE"
pdfstract batch ./daily_pdfs \
--library unstructured \
--format markdown \
--output "$OUTPUT_DIR" \
--parallel 8 \
--skip-errors
# Alert if failures
FAILED=$(jq '.statistics.failed' "$OUTPUT_DIR/batch_report.json")
if [ "$FAILED" -gt 0 ]; then
echo "⚠️ $FAILED conversions failed on $DATE" | mail admin@company.com
fi
Python: Programmatic Usage
import subprocess
import json
from pathlib import Path
def batch_convert(pdf_dir: str, library: str, output_dir: str) -> dict:
"""Wrapper around CLI"""
result = subprocess.run([
'pdfstract', 'batch', pdf_dir,
'--library', library,
'--output', output_dir,
'--parallel', '4'
])
# Load and parse report
report_file = Path(output_dir) / 'batch_report.json'
with open(report_file) as f:
report = json.load(f)
success_rate = (report['statistics']['success'] /
report['total_files'] * 100)
print(f"Success Rate: {success_rate:.1f}%")
return report
Docker: Containerized Processing
FROM python:3.13-slim
RUN pip install pdfstract
ENTRYPOINT ["pdfstract"]
CMD ["batch", "/data/input", "--library", "unstructured", "--output", "/data/output"]
# Build
docker build -t pdfstract .
# Run
docker run -v ./pdfs:/data/input -v ./converted:/data/output pdfstract
CI/CD: GitHub Actions Example
name: PDF Extraction
on: [push]
jobs:
extract:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install PDFStract
run: pip install pdfstract
- name: Extract PDFs
run: |
pdfstract batch ./source_pdfs \
--library unstructured \
--format json \
--output ./extracted
- name: Upload results
uses: actions/upload-artifact@v2
with:
name: extracted-pdfs
path: extracted/
Performance Tips
1. Test Before Large Batches
# Always test library first
pdfstract convert sample.pdf --library CHOSEN_LIB
# Then run batch
pdfstract batch ./1000_files --library CHOSEN_LIB --parallel 4
2. Choose Library Based on Speed vs Quality
- Speed: pymupdf4llm (--parallel 16)
- Balanced: unstructured (--parallel 6)
- Quality: marker (--parallel 2)
3. Monitor Long-Running Jobs
pdfstract batch ./files --library marker --parallel 2 2>&1 | tee job.log
# In another terminal
tail -f job.log
4. Retry Failed Conversions
# Extract failed files from report
jq -r '.files | to_entries[] | select(.value.status=="failed") | .key' \
batch_report.json > failed.txt
# Retry with different library
while read file; do
pdfstract convert "$file" --library marker --output "retry_results/${file%.pdf}.md"
done < failed.txt
Troubleshooting
CLI Takes Long to Start (8-10 seconds)
This is normal and expected for the first conversion. Here's why:
What's happening:
- First time you use
pdfstract convert, Python loads the PDF libraries - Libraries like torch, paddleocr, marker are large (~500MB-1GB)
- They need to initialize machine learning models on first use
- This happens only once per Python session
Solutions:
-
Use fast libraries for quick results:
# Fast: ~3-4 seconds pdfstract convert file.pdf --library pymupdf4llm # Slow: ~8-10 seconds (but better quality) pdfstract convert file.pdf --library marker
-
Help commands are instant (no startup delay):
pdfstract --help # Instant! pdfstract convert --help # Instant!
-
Batch processing hides the delay:
# First file: ~10s total (includes library loading) # Files 2-100: ~1-2s each # Total for 100 files: ~120s instead of 1000s pdfstract batch ./pdfs --library unstructured --parallel 4
-
Keep Python process alive for multiple conversions:
# Each call: 10s (wasteful!) pdfstract convert 1.pdf -l unstructured pdfstract convert 2.pdf -l unstructured # Better: Keep Python alive python -c " from services.cli_factory import CLILazyFactory factory = CLILazyFactory() factory.convert('unstructured', '1.pdf') factory.convert('unstructured', '2.pdf') # ~1s each! "
"Library 'X' not available"
Solution:
pdfstract libs # See what's available
# Install missing library
uv add LIBRARY_NAME
# or
pip install LIBRARY_NAME
"File is not a PDF"
Only PDF files are supported. Check:
- File extension is
.pdf - File is actually a PDF (not renamed)
- File is not corrupted
Batch Job Very Slow
Reduce parallel workers:
# Instead of --parallel 8
pdfstract batch ./docs --library marker --parallel 2
Or distribute job across multiple machines.
Memory Running Out
Reduce workers or process in smaller batches:
pdfstract batch ./docs --library marker --parallel 1
DeepSeek-OCR Not Working
Requires CUDA GPU. Alternatives:
- PaddleOCR (no GPU needed)
- Pytesseract (no GPU needed)
- Unstructured (CPU-based)
Batch Job Scheduler
For recurring jobs, use the batch scheduler:
# Create scheduled job
pdfstract-scheduler create daily_job \
./input_dir \
./output_dir \
--library unstructured \
--parallel 4
# Run job
pdfstract-scheduler run daily_job
# View execution history
pdfstract-scheduler history daily_job
# List all jobs
pdfstract-scheduler list
Add to cron for automated scheduling:
0 2 * * * pdfstract-scheduler run daily_job
Next Steps
- Read main README.md for Web UI guide
- Check batch processing use cases
- Join GitHub discussions for questions
- Open issues for bugs or feature requests
Support
- GitHub: https://github.com/aksarav/pdfstract
- Issues: https://github.com/aksarav/pdfstract/issues
- Discussions: https://github.com/aksarav/pdfstract/discussions
Happy extracting! 🚀📄✨
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfstract-1.0.3.tar.gz.
File metadata
- Download URL: pdfstract-1.0.3.tar.gz
- Upload date:
- Size: 49.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8843c5806ef3f3a38c001078ed49607ad36bb2dbea2c499eb10f39d69252f36c
|
|
| MD5 |
a3cea9b3166cd3075b63cec8a9b37f67
|
|
| BLAKE2b-256 |
f417d211e9338636c1fd41d8ca66875ee7ee92e8e69782e523d26ff5f191f776
|
File details
Details for the file pdfstract-1.0.3-py3-none-any.whl.
File metadata
- Download URL: pdfstract-1.0.3-py3-none-any.whl
- Upload date:
- Size: 39.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
52391be8a9567957cf04206d0d99d731ec9daad74342c06479d2e813528d800f
|
|
| MD5 |
466fd8256434c187cd960aa46fad8877
|
|
| BLAKE2b-256 |
a4bea4185065cb4843ea6b94dc3fb93912b42aff8c8409e566741aee9ef8afea
|