Skip to main content

PDFStract - Unified PDF Extraction & Conversion CLI + Web UI with 10+ extraction libraries

Project description

PDFStract - PDF Extraction & Conversion

A modern web application for converting PDFs to multiple formats using various state-of-the-art extraction libraries. Built with FastAPI backend and React frontend with a beautiful, responsive UI.

UI Screenshot

UI Screenshot 2

UI Screenshot 3

โœจ Features

  • ๐Ÿš€ 10+ Conversion Libraries: PyMuPDF4LLM, MarkItDown, Marker, Docling, PaddleOCR, DeepSeek-OCR, Tesseract, MinerU, Unstructured, and more
  • ๐Ÿ“ฑ Modern React UI: Beautiful, responsive design with Tailwind CSS
  • ๐Ÿ’ป Command-Line Interface: Full CLI with batch processing, multi-library comparison, and automation
  • ๐ŸŽฏ Multiple Output Formats: Markdown, JSON, and Plain Text
  • โฑ๏ธ Performance Benchmarking: Real-time timer shows conversion speed for each library
  • ๐Ÿ‘๏ธ Live Preview: View converted content with syntax highlighting
  • ๐Ÿ”„ Library Status Dashboard: See which libraries are available/unavailable with error messages
  • ๐Ÿ’พ Easy Download: Download results in your preferred format
  • ๐Ÿณ Docker Support: One-command deployment
  • ๐Ÿ”— REST API: Programmatic access to conversion features
  • โšก Batch Processing: Parallel conversion of 100+ PDFs with detailed reporting
  • ๐ŸŒ™ Dark Mode Ready: Works seamlessly in light and dark themes

๐Ÿ“š Supported Libraries

Library Version Type Status Notes
pymupdf4llm >=0.0.26 Text Extraction Fast Best for simple PDFs
markitdown >=0.1.2 Markdown Balanced Microsoft's conversion tool
marker >=1.8.1 Advanced ML High Quality Excellent results, slower
docling >=2.41.0 Document Intelligence Advanced IBM's document platform
paddleocr >=3.3.2 OCR Accurate Great for scanned PDFs
unstructured >=0.15.0 Document Parsing Smart Intelligent element extraction
deepseekocr Latest GPU OCR Fast (GPU only) Requires CUDA GPU
pytesseract >=0.3.10 OCR Classic Tesseract-based (requires system binary)

๐Ÿš€ Quick Start

Prerequisites

  • Python: 3.13+
  • UV: Fast Python package manager (install)
  • Node.js: 20+ (for frontend development)
  • Docker (optional): For containerized deployment

Installation

  1. Clone the repository:
git clone https://github.com/aksarav/pdfstract.git
cd pdfstract
  1. Install Python dependencies:
uv sync
  1. Install frontend dependencies:
cd frontend
npm install
cd ..

Running Locally

Terminal 1: Start the FastAPI Backend

uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Terminal 2: Start the React Frontend (Development)

cd frontend
npm run dev

Access the Application:

Note: The frontend development server proxies API calls to the backend at port 8000 (configured in frontend/vite.config.js)

Production Build

To build the React app for production:

cd frontend
npm run build

This creates an optimized build in frontend/dist/ which gets copied to /static by the Docker build process.

Running with Docker

docker-compose up --build

The application will be available at http://localhost:8000

Running with VS Code Debugger

  1. Press F5 or go to Run โ†’ Start Debugging
  2. The debugger will use the configuration in .vscode/launch.json
  3. Set breakpoints and debug your FastAPI backend

๐Ÿ–ฅ๏ธ Command-Line Interface (CLI)

PDFStract includes a powerful CLI for batch processing and automation.

Quick CLI Examples

# List available libraries
pdfstract libs

# Convert a single PDF
pdfstract convert document.pdf --library unstructured --output result.md

# Compare multiple libraries on one PDF
pdfstract compare sample.pdf -l unstructured -l marker -l pymupdf4llm --output ./comparison

# Batch convert 100+ PDFs in parallel
pdfstract batch ./documents --library unstructured --output ./converted --parallel 4

# Test which library works best on your corpus
pdfstract batch-compare ./papers -l marker -l unstructured --max-files 50 --output ./test

CLI Features

โœจ Full Features:

  • Single file conversion
  • Multi-library comparison
  • Parallel batch processing (1-16 workers)
  • Batch quality testing across corpus
  • JSON reporting with detailed statistics
  • Error handling and retry options
  • Progress indicators and rich formatting

๐Ÿ“Š Batch Processing:

  • Convert 1000+ PDFs with parallel workers
  • Detailed JSON reports (success rate, per-file status)
  • Automatic error handling and logging
  • Perfect for production jobs and legacy migrations

โ†’ Full CLI Documentation - See complete guide with real-world examples

๐Ÿ“– Usage

Web Interface (React Frontend)

Single Conversion:

  1. Upload PDF: Drag & drop or click to select a PDF file
  2. Select Library: Choose your preferred conversion library from the dropdown
  3. Choose Format: Select output format (Markdown, JSON, or Plain Text)
  4. Convert: Click "Convert PDF" button
  5. View Results:
    • See original PDF on the left
    • View converted content on the right
    • Switch between "Source" and "Preview" tabs
  6. Download: Click "Download" to save the results
  7. Performance: Real-time timer shows conversion speed

Compare Multiple Models (New Feature):

  1. Upload PDF: Select a PDF file
  2. Click "Compare Models": Opens library selection modal
  3. Select Libraries: Choose 1-3 converters to compare
  4. Watch Progress: Real-time progress bar shows which models are running
  5. View Results Grid: See all conversions in a table with:
    • Time taken for each
    • Output file size
    • Success/Failed/Timeout status
  6. Expand Details: Click a row to see full content
  7. Download: Download individual or all results
  8. History: Recent comparisons shown in left sidebar

API Usage

Check available libraries:

curl http://localhost:8000/libraries

Response:

{
  "libraries": [
    {
      "name": "pymupdf4llm",
      "available": true,
      "error": null
    },
    {
      "name": "deepseekocr",
      "available": false,
      "error": "GPU required but not available"
    }
  ]
}

Convert a PDF:

curl -X POST \
  -F "file=@sample.pdf" \
  -F "library=unstructured" \
  -F "output_format=markdown" \
  http://localhost:8000/convert

Response:

{
  "success": true,
  "library_used": "unstructured",
  "filename": "sample.pdf",
  "format": "markdown",
  "content": "# Document Title\n\n... extracted markdown ..."
}

For Batch Processing: Use the CLI instead

pdfstract batch ./documents --library unstructured --output ./converted --parallel 4

Advantages of CLI for batch jobs:

  • Parallel processing with configurable workers
  • JSON report with statistics (success rate, per-file status)
  • Error handling and retry options
  • Perfect for production automation
  • See CLI_README.md for full batch documentation

API Endpoints

Endpoint Method Description Parameters
/ GET Web interface -
/health GET Health check -
/libraries GET List available libraries -
/convert POST Convert PDF file, library, output_format

๐Ÿ—๏ธ Project Structure

pdfstract/
โ”œโ”€โ”€ main.py                          # FastAPI application with endpoints
โ”œโ”€โ”€ pyproject.toml                   # Python dependencies (uv)
โ”œโ”€โ”€ uv.lock                          # Locked dependencies
โ”œโ”€โ”€ Dockerfile                       # Docker configuration
โ”œโ”€โ”€ docker-compose.yml               # Docker compose setup
โ”œโ”€โ”€ README.md                        # This file
โ”‚
โ”œโ”€โ”€ frontend/                        # React application (Vite + Tailwind)
โ”‚   โ”œโ”€โ”€ src/
โ”‚   โ”‚   โ”œโ”€โ”€ App.jsx                 # Main React component & routes
โ”‚   โ”‚   โ”œโ”€โ”€ components/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ CompareModal.jsx           # Library selection modal
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ RecentComparisons.jsx      # History sidebar
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ComparisonResults.jsx      # Results display grid
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ui/                       # UI components (button, card, etc.)
โ”‚   โ”‚   โ”œโ”€โ”€ index.css               # Global styles
โ”‚   โ”‚   โ””โ”€โ”€ main.jsx                # React entry point
โ”‚   โ”œโ”€โ”€ dist/                       # Built frontend (production)
โ”‚   โ”œโ”€โ”€ vite.config.js              # Vite configuration & proxy setup
โ”‚   โ”œโ”€โ”€ tailwind.config.js          # Tailwind CSS config
โ”‚   โ”œโ”€โ”€ package.json                # Node dependencies
โ”‚   โ””โ”€โ”€ index.html                  # HTML entry point
โ”‚
โ”œโ”€โ”€ services/                        # Backend services
โ”‚   โ”œโ”€โ”€ db_service.py               # SQLite database operations
โ”‚   โ”œโ”€โ”€ queue_manager.py            # Parallel execution (max 3)
โ”‚   โ”œโ”€โ”€ results_manager.py          # File storage for results
โ”‚   โ”œโ”€โ”€ ocrfactory.py               # Converter factory & registry
โ”‚   โ”œโ”€โ”€ base.py                     # Base converter class
โ”‚   โ”œโ”€โ”€ logger.py                   # Logging configuration
โ”‚   โ””โ”€โ”€ converters/                 # Converter implementations
โ”‚       โ”œโ”€โ”€ pymupdf4llm_converter.py
โ”‚       โ”œโ”€โ”€ unstructured_converter.py
โ”‚       โ”œโ”€โ”€ mineru_converter.py
โ”‚       โ”œโ”€โ”€ marker_converter.py
โ”‚       โ”œโ”€โ”€ paddleocr_converter.py
โ”‚       โ””โ”€โ”€ ... (more converters)
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚   โ””โ”€โ”€ setup-mineru.sh             # MinerU separate venv setup
โ”‚
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ tasks.db                    # SQLite database (auto-created)
โ”‚
โ”œโ”€โ”€ results/                        # Conversion results storage
โ”‚   โ””โ”€โ”€ task_*/                     # Per-task directories
โ”‚
โ””โ”€โ”€ .vscode/
    โ””โ”€โ”€ launch.json                 # VS Code debugger config

๐Ÿ”ง Configuration

Environment Variables

Currently, no environment variables are required. The application is configured via:

  • main.py: Core FastAPI setup
  • pyproject.toml: Python dependencies
  • docker-compose.yml: Docker configuration

Frontend Configuration

The React frontend is configured via:

  • frontend/vite.config.js: Vite build config with API proxy
  • frontend/tailwind.config.js: Tailwind CSS theming
  • frontend/package.json: Node dependencies

API Proxy Setup

The frontend development server proxies API calls to the backend:

// frontend/vite.config.js
server: {
  proxy: {
    '/libraries': { target: 'http://localhost:8000' },
    '/convert': { target: 'http://localhost:8000' },
    '/compare': { target: 'http://localhost:8000' },
    '/history': { target: 'http://localhost:8000' },
    '/health': { target: 'http://localhost:8000' },
  }
}

Customization

Add a new converter:

  1. Create a new file in services/converters/:
from services.base import PDFConverter

class MyConverter(PDFConverter):
    @property
    def name(self) -> str:
        return "myconverter"
    
    @property
    def available(self) -> bool:
        return True
    
    async def convert_to_md(self, file_path: str) -> str:
        # Implementation
        pass
  1. Register in services/ocrfactory.py:
from services.converters.myconverter import MyConverter

# In _register_default_converters():
converters.append(MyConverter())

# In list_all_converters():
all_converters.append("myconverter")

๐Ÿ› Troubleshooting

Common Issues

Issue: Library shows as unavailable

  • Solution: Check dependencies with uv sync and verify system requirements

Issue: DeepSeek-OCR unavailable

  • Solution: Requires CUDA GPU. Install CUDA toolkit or use CPU-only alternatives

Issue: Docker container can't find dependencies

  • Solution: Rebuild with docker-compose up --build (no cache)

Issue: Large PDF timeout

  • Solution: Some libraries (marker, unstructured) are slower. Try pymupdf4llm for faster processing

System Requirements

For OCR libraries (PaddleOCR, Tesseract, DeepSeek-OCR):

  • macOS/Linux: System libraries may be needed
  • Windows: May require Visual C++ build tools

๐Ÿ“Š Performance Comparison

Use the built-in timer feature to benchmark:

Library Speed Quality Best For
pymupdf4llm โšกโšกโšก โญโญ Simple text extraction
unstructured โšกโšก โญโญโญ Complex layouts
markitdown โšกโšก โญโญโญ Balanced performance
marker โšก โญโญโญโญ Highest quality (ML-based)
docling โšก โญโญโญโญ Document intelligence
paddleocr โšก โญโญโญ Scanned PDFs
deepseekocr โšก โญโญโญ Scanned PDFs
pytesseract โšก โญโญโญ Scanned PDFs

NOTE: The performance comparison is based on the performance of the libraries when used with the default settings of the application. The performance may vary depending on the complexity of the PDF and the settings of the library.

๐Ÿ” Security

  • File uploads are stored temporarily and deleted after conversion
  • No data is persisted or logged
  • Use HTTPS in production
  • API endpoints are not authenticated (add authentication for production)

๐Ÿ“ Development

Frontend Development (Hot Reload)

cd frontend
npm run dev

Frontend will be available at http://localhost:5173 with hot-reload enabled.

Backend Development (With Debugger)

Use VS Code's Run & Debug feature:

  1. Press F5 or go to Run โ†’ Start Debugging
  2. Breakpoints and debugging work via .vscode/launch.json
  3. Backend reloads on file changes via --reload flag

Adding Frontend Dependencies

cd frontend
npm install <package-name>

Building Frontend for Production

cd frontend
npm run build

Output: frontend/dist/ โ†’ Gets copied to /app/static in Docker

๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

๐Ÿ“„ License

This project is provided as-is for educational and development purposes.

๐ŸŒŸ Features Roadmap

  • Batch PDF conversion
  • Convert and Compare multiple PDFs and Generate a Report
  • Conversion history and Task Management
  • Cloud storage integration - Read from and write to cloud storage
  • REST API documentation (Swagger UI)

๐Ÿ“ž Support

If you encounter issues or have questions:

  1. Check the Troubleshooting section
  2. Review converter-specific documentation
  3. Open an issue on GitHub

๐ŸŒŸ Please leave a star if you find this project useful

๐Ÿ™ Acknowledgments

  • FastAPI: Modern Python web framework
  • React: UI library
  • Tailwind CSS: Utility-first CSS framework
  • Lucide Icons: Beautiful icon library
  • All the amazing PDF extraction libraries (PyMuPDF, Marker, Docling, etc.)

**Made with โค๏ธ for PDF enthusiasts **

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfstract-1.0.2.tar.gz (44.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdfstract-1.0.2-py3-none-any.whl (38.8 kB view details)

Uploaded Python 3

File details

Details for the file pdfstract-1.0.2.tar.gz.

File metadata

  • Download URL: pdfstract-1.0.2.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for pdfstract-1.0.2.tar.gz
Algorithm Hash digest
SHA256 aae49cf81795467029b0c6d983f2a6c30f4dd2d5baf87b791400b13e47dce525
MD5 a8b42e6158f6f5f3b421784d6c4bded1
BLAKE2b-256 727e43f50f6794ef7592124b5e2a5fb27865e1bf0446157be6f209c1a2a79dad

See more details on using hashes here.

File details

Details for the file pdfstract-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: pdfstract-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 38.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.13

File hashes

Hashes for pdfstract-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 563844b47653e2daff96ff3866f8dd4eca37bcce226c83ca8a7b376119e7ce7f
MD5 e58b036f6da10e753f351fac4c666724
BLAKE2b-256 6ec9a43c3f54a37a2d4a6e205ef3b6ed7b7475cee93892e1c50dbef204e1008c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page