PDFStract - Unified PDF Extraction & Conversion CLI + Web UI with 10+ extraction libraries

Project description

PDFStract - PDF Extraction & Conversion

A modern web application for converting PDFs to multiple formats using various state-of-the-art extraction libraries. Built with FastAPI backend and React frontend with a beautiful, responsive UI.

UI Screenshot

UI Screenshot 2

UI Screenshot 3

✨ Features

🚀 10+ Conversion Libraries: PyMuPDF4LLM, MarkItDown, Marker, Docling, PaddleOCR, DeepSeek-OCR, Tesseract, MinerU, Unstructured, and more
📱 Modern React UI: Beautiful, responsive design with Tailwind CSS
💻 Command-Line Interface: Full CLI with batch processing, multi-library comparison, and automation
🎯 Multiple Output Formats: Markdown, JSON, and Plain Text
⏱️ Performance Benchmarking: Real-time timer shows conversion speed for each library
👁️ Live Preview: View converted content with syntax highlighting
🔄 Library Status Dashboard: See which libraries are available/unavailable with error messages
💾 Easy Download: Download results in your preferred format
🐳 Docker Support: One-command deployment
🔗 REST API: Programmatic access to conversion features
⚡ Batch Processing: Parallel conversion of 100+ PDFs with detailed reporting
🌙 Dark Mode Ready: Works seamlessly in light and dark themes

📚 Supported Libraries

Library	Version	Type	Status	Notes
pymupdf4llm	>=0.0.26	Text Extraction	Fast	Best for simple PDFs
markitdown	>=0.1.2	Markdown	Balanced	Microsoft's conversion tool
marker	>=1.8.1	Advanced ML	High Quality	Excellent results, slower
docling	>=2.41.0	Document Intelligence	Advanced	IBM's document platform
paddleocr	>=3.3.2	OCR	Accurate	Great for scanned PDFs
unstructured	>=0.15.0	Document Parsing	Smart	Intelligent element extraction
deepseekocr	Latest	GPU OCR	Fast (GPU only)	Requires CUDA GPU
pytesseract	>=0.3.10	OCR	Classic	Tesseract-based (requires system binary)

🚀 Quick Start

Prerequisites

Python: 3.13+
UV: Fast Python package manager (install)
Node.js: 20+ (for frontend development)
Docker (optional): For containerized deployment

Installation

Clone the repository:

git clone https://github.com/aksarav/pdfstract.git
cd pdfstract

Install Python dependencies:

uv sync

Install frontend dependencies:

cd frontend
npm install
cd ..

Running Locally

Terminal 1: Start the FastAPI Backend

uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload

Terminal 2: Start the React Frontend (Development)

cd frontend
npm run dev

Access the Application:

Frontend: http://localhost:5173 (with hot-reload)
Backend API: http://localhost:8000

Note: The frontend development server proxies API calls to the backend at port 8000 (configured in frontend/vite.config.js)

Production Build

To build the React app for production:

cd frontend
npm run build

This creates an optimized build in frontend/dist/ which gets copied to /static by the Docker build process.

Running with Docker

docker-compose up --build

The application will be available at http://localhost:8000

Running with VS Code Debugger

Press F5 or go to Run → Start Debugging
The debugger will use the configuration in .vscode/launch.json
Set breakpoints and debug your FastAPI backend

🖥️ Command-Line Interface (CLI)

PDFStract includes a powerful CLI for batch processing and automation.

Quick CLI Examples

# List available libraries
pdfstract libs

# Convert a single PDF
pdfstract convert document.pdf --library unstructured --output result.md

# Compare multiple libraries on one PDF
pdfstract compare sample.pdf -l unstructured -l marker -l pymupdf4llm --output ./comparison

# Batch convert 100+ PDFs in parallel
pdfstract batch ./documents --library unstructured --output ./converted --parallel 4

# Test which library works best on your corpus
pdfstract batch-compare ./papers -l marker -l unstructured --max-files 50 --output ./test

CLI Features

✨ Full Features:

Single file conversion
Multi-library comparison
Parallel batch processing (1-16 workers)
Batch quality testing across corpus
JSON reporting with detailed statistics
Error handling and retry options
Progress indicators and rich formatting

📊 Batch Processing:

Convert 1000+ PDFs with parallel workers
Detailed JSON reports (success rate, per-file status)
Automatic error handling and logging
Perfect for production jobs and legacy migrations

→ Full CLI Documentation - See complete guide with real-world examples

📖 Usage

Web Interface (React Frontend)

Single Conversion:

Upload PDF: Drag & drop or click to select a PDF file
Select Library: Choose your preferred conversion library from the dropdown
Choose Format: Select output format (Markdown, JSON, or Plain Text)
Convert: Click "Convert PDF" button
View Results:
- See original PDF on the left
- View converted content on the right
- Switch between "Source" and "Preview" tabs
Download: Click "Download" to save the results
Performance: Real-time timer shows conversion speed

Compare Multiple Models (New Feature):

Upload PDF: Select a PDF file
Click "Compare Models": Opens library selection modal
Select Libraries: Choose 1-3 converters to compare
Watch Progress: Real-time progress bar shows which models are running
View Results Grid: See all conversions in a table with:
- Time taken for each
- Output file size
- Success/Failed/Timeout status
Expand Details: Click a row to see full content
Download: Download individual or all results
History: Recent comparisons shown in left sidebar

API Usage

Check available libraries:

curl http://localhost:8000/libraries

Response:

{
  "libraries": [
    {
      "name": "pymupdf4llm",
      "available": true,
      "error": null
    },
    {
      "name": "deepseekocr",
      "available": false,
      "error": "GPU required but not available"
    }
  ]
}

Convert a PDF:

curl -X POST \
  -F "file=@sample.pdf" \
  -F "library=unstructured" \
  -F "output_format=markdown" \
  http://localhost:8000/convert

Response:

{
  "success": true,
  "library_used": "unstructured",
  "filename": "sample.pdf",
  "format": "markdown",
  "content": "# Document Title\n\n... extracted markdown ..."
}

For Batch Processing: Use the CLI instead

pdfstract batch ./documents --library unstructured --output ./converted --parallel 4

Advantages of CLI for batch jobs:

Parallel processing with configurable workers
JSON report with statistics (success rate, per-file status)
Error handling and retry options
Perfect for production automation
See CLI_README.md for full batch documentation

API Endpoints

Endpoint	Method	Description	Parameters
`/`	GET	Web interface	-
`/health`	GET	Health check	-
`/libraries`	GET	List available libraries	-
`/convert`	POST	Convert PDF	`file`, `library`, `output_format`

🏗️ Project Structure

pdfstract/
├── main.py                          # FastAPI application with endpoints
├── pyproject.toml                   # Python dependencies (uv)
├── uv.lock                          # Locked dependencies
├── Dockerfile                       # Docker configuration
├── docker-compose.yml               # Docker compose setup
├── README.md                        # This file
│
├── frontend/                        # React application (Vite + Tailwind)
│   ├── src/
│   │   ├── App.jsx                 # Main React component & routes
│   │   ├── components/
│   │   │   ├── CompareModal.jsx           # Library selection modal
│   │   │   ├── RecentComparisons.jsx      # History sidebar
│   │   │   ├── ComparisonResults.jsx      # Results display grid
│   │   │   └── ui/                       # UI components (button, card, etc.)
│   │   ├── index.css               # Global styles
│   │   └── main.jsx                # React entry point
│   ├── dist/                       # Built frontend (production)
│   ├── vite.config.js              # Vite configuration & proxy setup
│   ├── tailwind.config.js          # Tailwind CSS config
│   ├── package.json                # Node dependencies
│   └── index.html                  # HTML entry point
│
├── services/                        # Backend services
│   ├── db_service.py               # SQLite database operations
│   ├── queue_manager.py            # Parallel execution (max 3)
│   ├── results_manager.py          # File storage for results
│   ├── ocrfactory.py               # Converter factory & registry
│   ├── base.py                     # Base converter class
│   ├── logger.py                   # Logging configuration
│   └── converters/                 # Converter implementations
│       ├── pymupdf4llm_converter.py
│       ├── unstructured_converter.py
│       ├── mineru_converter.py
│       ├── marker_converter.py
│       ├── paddleocr_converter.py
│       └── ... (more converters)
│
├── scripts/
│   └── setup-mineru.sh             # MinerU separate venv setup
│
├── data/
│   └── tasks.db                    # SQLite database (auto-created)
│
├── results/                        # Conversion results storage
│   └── task_*/                     # Per-task directories
│
└── .vscode/
    └── launch.json                 # VS Code debugger config

🔧 Configuration

Environment Variables

Currently, no environment variables are required. The application is configured via:

main.py: Core FastAPI setup
pyproject.toml: Python dependencies
docker-compose.yml: Docker configuration

Frontend Configuration

The React frontend is configured via:

frontend/vite.config.js: Vite build config with API proxy
frontend/tailwind.config.js: Tailwind CSS theming
frontend/package.json: Node dependencies

API Proxy Setup

The frontend development server proxies API calls to the backend:

// frontend/vite.config.js
server: {
  proxy: {
    '/libraries': { target: 'http://localhost:8000' },
    '/convert': { target: 'http://localhost:8000' },
    '/compare': { target: 'http://localhost:8000' },
    '/history': { target: 'http://localhost:8000' },
    '/health': { target: 'http://localhost:8000' },
  }
}

Customization

Add a new converter:

Create a new file in services/converters/:

from services.base import PDFConverter

class MyConverter(PDFConverter):
    @property
    def name(self) -> str:
        return "myconverter"
    
    @property
    def available(self) -> bool:
        return True
    
    async def convert_to_md(self, file_path: str) -> str:
        # Implementation
        pass

from services.converters.myconverter import MyConverter

# In _register_default_converters():
converters.append(MyConverter())

# In list_all_converters():
all_converters.append("myconverter")

🐛 Troubleshooting

Common Issues

Issue: Library shows as unavailable

Solution: Check dependencies with uv sync and verify system requirements

Issue: DeepSeek-OCR unavailable

Solution: Requires CUDA GPU. Install CUDA toolkit or use CPU-only alternatives

Issue: Docker container can't find dependencies

Solution: Rebuild with docker-compose up --build (no cache)

Issue: Large PDF timeout

Solution: Some libraries (marker, unstructured) are slower. Try pymupdf4llm for faster processing

System Requirements

For OCR libraries (PaddleOCR, Tesseract, DeepSeek-OCR):

macOS/Linux: System libraries may be needed
Windows: May require Visual C++ build tools

📊 Performance Comparison

Use the built-in timer feature to benchmark:

Library	Speed	Quality	Best For
pymupdf4llm	⚡⚡⚡	⭐⭐	Simple text extraction
unstructured	⚡⚡	⭐⭐⭐	Complex layouts
markitdown	⚡⚡	⭐⭐⭐	Balanced performance
marker	⚡	⭐⭐⭐⭐	Highest quality (ML-based)
docling	⚡	⭐⭐⭐⭐	Document intelligence
paddleocr	⚡	⭐⭐⭐	Scanned PDFs
deepseekocr	⚡	⭐⭐⭐	Scanned PDFs
pytesseract	⚡	⭐⭐⭐	Scanned PDFs

NOTE: The performance comparison is based on the performance of the libraries when used with the default settings of the application. The performance may vary depending on the complexity of the PDF and the settings of the library.

🔐 Security

File uploads are stored temporarily and deleted after conversion
No data is persisted or logged
Use HTTPS in production
API endpoints are not authenticated (add authentication for production)

📝 Development

Frontend Development (Hot Reload)

cd frontend
npm run dev

Frontend will be available at http://localhost:5173 with hot-reload enabled.

Backend Development (With Debugger)

Use VS Code's Run & Debug feature:

Press F5 or go to Run → Start Debugging
Breakpoints and debugging work via .vscode/launch.json
Backend reloads on file changes via --reload flag

Adding Frontend Dependencies

cd frontend
npm install <package-name>

Building Frontend for Production

cd frontend
npm run build

Output: frontend/dist/ → Gets copied to /app/static in Docker

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📄 License

This project is provided as-is for educational and development purposes.

🌟 Features Roadmap

Batch PDF conversion
Convert and Compare multiple PDFs and Generate a Report
Conversion history and Task Management
Cloud storage integration - Read from and write to cloud storage
REST API documentation (Swagger UI)

📞 Support

If you encounter issues or have questions:

Check the Troubleshooting section
Review converter-specific documentation
Open an issue on GitHub

🌟 Please leave a star if you find this project useful

🙏 Acknowledgments

FastAPI: Modern Python web framework
React: UI library
Tailwind CSS: Utility-first CSS framework
Lucide Icons: Beautiful icon library
All the amazing PDF extraction libraries (PyMuPDF, Marker, Docling, etc.)

**Made with ❤️ for PDF enthusiasts **

Project details

Release history Release notifications | RSS feed

1.1.1

Mar 17, 2026

1.1.0

Feb 12, 2026

1.0.4

Dec 21, 2025

1.0.3

Dec 21, 2025

1.0.2

Dec 21, 2025

1.0.1

Dec 20, 2025

This version

1.0.0

Dec 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfstract-1.0.0.tar.gz (35.2 kB view details)

Uploaded Dec 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdfstract-1.0.0-py3-none-any.whl (37.0 kB view details)

Uploaded Dec 20, 2025 Python 3

File details

Details for the file pdfstract-1.0.0.tar.gz.

File metadata

Download URL: pdfstract-1.0.0.tar.gz
Upload date: Dec 20, 2025
Size: 35.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdfstract-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`742cf3366ad4cfeda31dfafe8d92faec608abd6636179a64bbe4eafa948223f9`
MD5	`3d36b7572b12d5d496eaf102d2e51635`
BLAKE2b-256	`ea24fb06021cb123dd017a04482f210f88071aa1db5f9bfc380ae202d1c12e90`

See more details on using hashes here.

File details

Details for the file pdfstract-1.0.0-py3-none-any.whl.

File metadata

Download URL: pdfstract-1.0.0-py3-none-any.whl
Upload date: Dec 20, 2025
Size: 37.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pdfstract-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c3da8b0e764e4f38dd0e45f9c09b3afc86e1ae2f557341a2d93dd2a0e2a0abec`
MD5	`798966888cdff515d686aa66afeb1144`
BLAKE2b-256	`8b12a4e123eb62a9aa9171901e003d746a0fac2e2899899a405018b81d363ed2`

See more details on using hashes here.

pdfstract 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

PDFStract - PDF Extraction & Conversion

✨ Features

📚 Supported Libraries

🚀 Quick Start

Prerequisites

Installation

Running Locally

Production Build

Running with Docker

Running with VS Code Debugger

🖥️ Command-Line Interface (CLI)

Quick CLI Examples

CLI Features

📖 Usage

Web Interface (React Frontend)

API Usage

API Endpoints

🏗️ Project Structure

🔧 Configuration

Environment Variables

Frontend Configuration

API Proxy Setup

Customization

🐛 Troubleshooting

Common Issues

System Requirements

📊 Performance Comparison

🔐 Security

📝 Development

Frontend Development (Hot Reload)

Backend Development (With Debugger)

Adding Frontend Dependencies

Building Frontend for Production

🤝 Contributing

📄 License

🌟 Features Roadmap

📞 Support

🌟 Please leave a star if you find this project useful

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes