PDFStract - Unified PDF Extraction & Conversion CLI + Web UI with 10+ extraction libraries
Project description
PDFStract - PDF Extraction & Conversion
A modern web application for converting PDFs to multiple formats using various state-of-the-art extraction libraries. Built with FastAPI backend and React frontend with a beautiful, responsive UI.
โจ Features
- ๐ 10+ Conversion Libraries: PyMuPDF4LLM, MarkItDown, Marker, Docling, PaddleOCR, DeepSeek-OCR, Tesseract, MinerU, Unstructured, and more
- ๐ฑ Modern React UI: Beautiful, responsive design with Tailwind CSS
- ๐ป Command-Line Interface: Full CLI with batch processing, multi-library comparison, and automation
- ๐ฏ Multiple Output Formats: Markdown, JSON, and Plain Text
- โฑ๏ธ Performance Benchmarking: Real-time timer shows conversion speed for each library
- ๐๏ธ Live Preview: View converted content with syntax highlighting
- ๐ Library Status Dashboard: See which libraries are available/unavailable with error messages
- ๐พ Easy Download: Download results in your preferred format
- ๐ณ Docker Support: One-command deployment
- ๐ REST API: Programmatic access to conversion features
- โก Batch Processing: Parallel conversion of 100+ PDFs with detailed reporting
- ๐ Dark Mode Ready: Works seamlessly in light and dark themes
๐ Supported Libraries
| Library | Version | Type | Status | Notes |
|---|---|---|---|---|
| pymupdf4llm | >=0.0.26 | Text Extraction | Fast | Best for simple PDFs |
| markitdown | >=0.1.2 | Markdown | Balanced | Microsoft's conversion tool |
| marker | >=1.8.1 | Advanced ML | High Quality | Excellent results, slower |
| docling | >=2.41.0 | Document Intelligence | Advanced | IBM's document platform |
| paddleocr | >=3.3.2 | OCR | Accurate | Great for scanned PDFs |
| unstructured | >=0.15.0 | Document Parsing | Smart | Intelligent element extraction |
| deepseekocr | Latest | GPU OCR | Fast (GPU only) | Requires CUDA GPU |
| pytesseract | >=0.3.10 | OCR | Classic | Tesseract-based (requires system binary) |
๐ Quick Start
Prerequisites
- Python: 3.13+
- UV: Fast Python package manager (install)
- Node.js: 20+ (for frontend development)
- Docker (optional): For containerized deployment
Installation
- Clone the repository:
git clone https://github.com/aksarav/pdfstract.git
cd pdfstract
- Install Python dependencies:
uv sync
- Install frontend dependencies:
cd frontend
npm install
cd ..
Running Locally
Terminal 1: Start the FastAPI Backend
uv run uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Terminal 2: Start the React Frontend (Development)
cd frontend
npm run dev
Access the Application:
- Frontend: http://localhost:5173 (with hot-reload)
- Backend API: http://localhost:8000
Note: The frontend development server proxies API calls to the backend at port 8000 (configured in frontend/vite.config.js)
Production Build
To build the React app for production:
cd frontend
npm run build
This creates an optimized build in frontend/dist/ which gets copied to /static by the Docker build process.
Running with Docker
docker-compose up --build
The application will be available at http://localhost:8000
Running with VS Code Debugger
- Press
F5or go to Run โ Start Debugging - The debugger will use the configuration in
.vscode/launch.json - Set breakpoints and debug your FastAPI backend
๐ฅ๏ธ Command-Line Interface (CLI)
PDFStract includes a powerful CLI for batch processing and automation.
Quick CLI Examples
# List available libraries
pdfstract libs
# Convert a single PDF
pdfstract convert document.pdf --library unstructured --output result.md
# Compare multiple libraries on one PDF
pdfstract compare sample.pdf -l unstructured -l marker -l pymupdf4llm --output ./comparison
# Batch convert 100+ PDFs in parallel
pdfstract batch ./documents --library unstructured --output ./converted --parallel 4
# Test which library works best on your corpus
pdfstract batch-compare ./papers -l marker -l unstructured --max-files 50 --output ./test
CLI Features
โจ Full Features:
- Single file conversion
- Multi-library comparison
- Parallel batch processing (1-16 workers)
- Batch quality testing across corpus
- JSON reporting with detailed statistics
- Error handling and retry options
- Progress indicators and rich formatting
๐ Batch Processing:
- Convert 1000+ PDFs with parallel workers
- Detailed JSON reports (success rate, per-file status)
- Automatic error handling and logging
- Perfect for production jobs and legacy migrations
โ Full CLI Documentation - See complete guide with real-world examples
๐ Usage
Web Interface (React Frontend)
Single Conversion:
- Upload PDF: Drag & drop or click to select a PDF file
- Select Library: Choose your preferred conversion library from the dropdown
- Choose Format: Select output format (Markdown, JSON, or Plain Text)
- Convert: Click "Convert PDF" button
- View Results:
- See original PDF on the left
- View converted content on the right
- Switch between "Source" and "Preview" tabs
- Download: Click "Download" to save the results
- Performance: Real-time timer shows conversion speed
Compare Multiple Models (New Feature):
- Upload PDF: Select a PDF file
- Click "Compare Models": Opens library selection modal
- Select Libraries: Choose 1-3 converters to compare
- Watch Progress: Real-time progress bar shows which models are running
- View Results Grid: See all conversions in a table with:
- Time taken for each
- Output file size
- Success/Failed/Timeout status
- Expand Details: Click a row to see full content
- Download: Download individual or all results
- History: Recent comparisons shown in left sidebar
API Usage
Check available libraries:
curl http://localhost:8000/libraries
Response:
{
"libraries": [
{
"name": "pymupdf4llm",
"available": true,
"error": null
},
{
"name": "deepseekocr",
"available": false,
"error": "GPU required but not available"
}
]
}
Convert a PDF:
curl -X POST \
-F "file=@sample.pdf" \
-F "library=unstructured" \
-F "output_format=markdown" \
http://localhost:8000/convert
Response:
{
"success": true,
"library_used": "unstructured",
"filename": "sample.pdf",
"format": "markdown",
"content": "# Document Title\n\n... extracted markdown ..."
}
For Batch Processing: Use the CLI instead
pdfstract batch ./documents --library unstructured --output ./converted --parallel 4
Advantages of CLI for batch jobs:
- Parallel processing with configurable workers
- JSON report with statistics (success rate, per-file status)
- Error handling and retry options
- Perfect for production automation
- See CLI_README.md for full batch documentation
API Endpoints
| Endpoint | Method | Description | Parameters |
|---|---|---|---|
/ |
GET | Web interface | - |
/health |
GET | Health check | - |
/libraries |
GET | List available libraries | - |
/convert |
POST | Convert PDF | file, library, output_format |
๐๏ธ Project Structure
pdfstract/
โโโ main.py # FastAPI application with endpoints
โโโ pyproject.toml # Python dependencies (uv)
โโโ uv.lock # Locked dependencies
โโโ Dockerfile # Docker configuration
โโโ docker-compose.yml # Docker compose setup
โโโ README.md # This file
โ
โโโ frontend/ # React application (Vite + Tailwind)
โ โโโ src/
โ โ โโโ App.jsx # Main React component & routes
โ โ โโโ components/
โ โ โ โโโ CompareModal.jsx # Library selection modal
โ โ โ โโโ RecentComparisons.jsx # History sidebar
โ โ โ โโโ ComparisonResults.jsx # Results display grid
โ โ โ โโโ ui/ # UI components (button, card, etc.)
โ โ โโโ index.css # Global styles
โ โ โโโ main.jsx # React entry point
โ โโโ dist/ # Built frontend (production)
โ โโโ vite.config.js # Vite configuration & proxy setup
โ โโโ tailwind.config.js # Tailwind CSS config
โ โโโ package.json # Node dependencies
โ โโโ index.html # HTML entry point
โ
โโโ services/ # Backend services
โ โโโ db_service.py # SQLite database operations
โ โโโ queue_manager.py # Parallel execution (max 3)
โ โโโ results_manager.py # File storage for results
โ โโโ ocrfactory.py # Converter factory & registry
โ โโโ base.py # Base converter class
โ โโโ logger.py # Logging configuration
โ โโโ converters/ # Converter implementations
โ โโโ pymupdf4llm_converter.py
โ โโโ unstructured_converter.py
โ โโโ mineru_converter.py
โ โโโ marker_converter.py
โ โโโ paddleocr_converter.py
โ โโโ ... (more converters)
โ
โโโ scripts/
โ โโโ setup-mineru.sh # MinerU separate venv setup
โ
โโโ data/
โ โโโ tasks.db # SQLite database (auto-created)
โ
โโโ results/ # Conversion results storage
โ โโโ task_*/ # Per-task directories
โ
โโโ .vscode/
โโโ launch.json # VS Code debugger config
๐ง Configuration
Environment Variables
Currently, no environment variables are required. The application is configured via:
main.py: Core FastAPI setuppyproject.toml: Python dependenciesdocker-compose.yml: Docker configuration
Frontend Configuration
The React frontend is configured via:
frontend/vite.config.js: Vite build config with API proxyfrontend/tailwind.config.js: Tailwind CSS themingfrontend/package.json: Node dependencies
API Proxy Setup
The frontend development server proxies API calls to the backend:
// frontend/vite.config.js
server: {
proxy: {
'/libraries': { target: 'http://localhost:8000' },
'/convert': { target: 'http://localhost:8000' },
'/compare': { target: 'http://localhost:8000' },
'/history': { target: 'http://localhost:8000' },
'/health': { target: 'http://localhost:8000' },
}
}
Customization
Add a new converter:
- Create a new file in
services/converters/:
from services.base import PDFConverter
class MyConverter(PDFConverter):
@property
def name(self) -> str:
return "myconverter"
@property
def available(self) -> bool:
return True
async def convert_to_md(self, file_path: str) -> str:
# Implementation
pass
- Register in
services/ocrfactory.py:
from services.converters.myconverter import MyConverter
# In _register_default_converters():
converters.append(MyConverter())
# In list_all_converters():
all_converters.append("myconverter")
๐ Troubleshooting
Common Issues
Issue: Library shows as unavailable
- Solution: Check dependencies with
uv syncand verify system requirements
Issue: DeepSeek-OCR unavailable
- Solution: Requires CUDA GPU. Install CUDA toolkit or use CPU-only alternatives
Issue: Docker container can't find dependencies
- Solution: Rebuild with
docker-compose up --build(no cache)
Issue: Large PDF timeout
- Solution: Some libraries (marker, unstructured) are slower. Try pymupdf4llm for faster processing
System Requirements
For OCR libraries (PaddleOCR, Tesseract, DeepSeek-OCR):
- macOS/Linux: System libraries may be needed
- Windows: May require Visual C++ build tools
๐ Performance Comparison
Use the built-in timer feature to benchmark:
| Library | Speed | Quality | Best For |
|---|---|---|---|
| pymupdf4llm | โกโกโก | โญโญ | Simple text extraction |
| unstructured | โกโก | โญโญโญ | Complex layouts |
| markitdown | โกโก | โญโญโญ | Balanced performance |
| marker | โก | โญโญโญโญ | Highest quality (ML-based) |
| docling | โก | โญโญโญโญ | Document intelligence |
| paddleocr | โก | โญโญโญ | Scanned PDFs |
| deepseekocr | โก | โญโญโญ | Scanned PDFs |
| pytesseract | โก | โญโญโญ | Scanned PDFs |
NOTE: The performance comparison is based on the performance of the libraries when used with the default settings of the application. The performance may vary depending on the complexity of the PDF and the settings of the library.
๐ Security
- File uploads are stored temporarily and deleted after conversion
- No data is persisted or logged
- Use HTTPS in production
- API endpoints are not authenticated (add authentication for production)
๐ Development
Frontend Development (Hot Reload)
cd frontend
npm run dev
Frontend will be available at http://localhost:5173 with hot-reload enabled.
Backend Development (With Debugger)
Use VS Code's Run & Debug feature:
- Press
F5or go to Run โ Start Debugging - Breakpoints and debugging work via
.vscode/launch.json - Backend reloads on file changes via
--reloadflag
Adding Frontend Dependencies
cd frontend
npm install <package-name>
Building Frontend for Production
cd frontend
npm run build
Output: frontend/dist/ โ Gets copied to /app/static in Docker
๐ค Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
๐ License
This project is provided as-is for educational and development purposes.
๐ Features Roadmap
- Batch PDF conversion
- Convert and Compare multiple PDFs and Generate a Report
- Conversion history and Task Management
- Cloud storage integration - Read from and write to cloud storage
- REST API documentation (Swagger UI)
๐ Support
If you encounter issues or have questions:
- Check the Troubleshooting section
- Review converter-specific documentation
- Open an issue on GitHub
๐ Please leave a star if you find this project useful
๐ Acknowledgments
- FastAPI: Modern Python web framework
- React: UI library
- Tailwind CSS: Utility-first CSS framework
- Lucide Icons: Beautiful icon library
- All the amazing PDF extraction libraries (PyMuPDF, Marker, Docling, etc.)
**Made with โค๏ธ for PDF enthusiasts **
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdfstract-1.0.0.tar.gz.
File metadata
- Download URL: pdfstract-1.0.0.tar.gz
- Upload date:
- Size: 35.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
742cf3366ad4cfeda31dfafe8d92faec608abd6636179a64bbe4eafa948223f9
|
|
| MD5 |
3d36b7572b12d5d496eaf102d2e51635
|
|
| BLAKE2b-256 |
ea24fb06021cb123dd017a04482f210f88071aa1db5f9bfc380ae202d1c12e90
|
File details
Details for the file pdfstract-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pdfstract-1.0.0-py3-none-any.whl
- Upload date:
- Size: 37.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3da8b0e764e4f38dd0e45f9c09b3afc86e1ae2f557341a2d93dd2a0e2a0abec
|
|
| MD5 |
798966888cdff515d686aa66afeb1144
|
|
| BLAKE2b-256 |
8b12a4e123eb62a9aa9171901e003d746a0fac2e2899899a405018b81d363ed2
|