Convert PDF files to EPUB format via Markdown with intelligent layout detection
Project description
PDF2EPUB ๐
A powerful Python package for converting PDF files to EPUB format via Markdown with intelligent layout detection, AI-powered postprocessing, and seamless CLI/API integration.
โจ Features
- ๐ Smart Layout Detection - Handles books, academic papers, and complex documents
- ๐ Advanced PDF Processing - OCR, table detection, and image extraction
- ๐ค AI Postprocessing - Enhance quality with Anthropic Claude integration
- ๐ Clean Markdown Output - Structured, readable markdown with preserved formatting
- ๐ฑ Professional EPUB - High-quality EPUB 3.0 output with customizable styling
- ๐ Multi-language Support - Process documents in multiple languages
- ๐ GPU Acceleration - NVIDIA CUDA and AMD ROCm support for faster processing
- ๐ Apple Silicon Support - Optimized performance on Apple Silicon devices
- ๐ ๏ธ Flexible API - Use as CLI tool or import as Python library
- ๐ Plugin Architecture - Extensible AI provider system
๐ Quick Start
Installation
# Basic installation
pip install pdf2epub
# Full installation with all features
pip install pdf2epub[full]
Command Line Usage
# Convert a PDF to EPUB
pdf2epub document.pdf
# Advanced options
pdf2epub book.pdf --start-page 10 --max-pages 50 --langs "English,German"
Python API
- For Apple Silicon, install with MPS support:
pip3 uninstall torch torchvision torchaudio
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
- For Apple Silicon, install with MPS support:
pip3 uninstall torch torchvision torchaudio
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
- Verify GPU support:
import torch
print(torch.__version__) # PyTorch version
print(torch.cuda.is_available()) # Should return True for NVIDIA
print(torch.mps.is_available()) # Should return True for Apple Silicon
print(torch.version.hip) # Should print ROCm version for AMD
import pdf2epub
# Simple conversion
pdf2epub.convert_pdf_to_markdown("document.pdf", "output/")
pdf2epub.convert_markdown_to_epub("output/", "final/")
# Advanced usage with AI enhancement
processor = pdf2epub.AIPostprocessor("output/")
processor.run_postprocessing("document.md", "anthropic")
๐ฆ Installation Options
Basic Installation
pip install pdf2epub
Includes core functionality with minimal dependencies.
Full Installation
pip install pdf2epub[full]
Includes all features: PDF processing, AI postprocessing, and GPU acceleration.
Development Installation
pip install pdf2epub[dev]
Includes development tools: testing, linting, and formatting.
GPU Support
NVIDIA CUDA:
pip install pdf2epub[full]
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
AMD ROCm:
pip install pdf2epub[full]
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2
๐ Documentation
- Quick Tutorial - Convert your first PDF in 5 minutes
- Installation Guide - Detailed setup instructions
- CLI Reference - Complete command-line documentation
- Python API - Library usage and examples
- Advanced Features - GPU acceleration, batch processing
- AI Integration - Enhance quality with AI postprocessing
- Plugin Development - Create custom AI providers
๐ฏ Use Cases
Academic Research
- Convert research papers to readable EPUB format
- Extract and preserve mathematical equations
- Maintain citation formatting and structure
Digital Publishing
- Transform print-ready PDFs into distribution-ready EPUBs
- Preserve complex layouts and formatting
- Optimize for e-reader compatibility
Document Archival
- Convert legacy documents to modern formats
- Batch process document collections
- Enhance readability with AI postprocessing
Accessibility
- Create screen-reader compatible versions
- Improve text structure and navigation
- Add semantic markup for better accessibility
๐ง Configuration
Environment Variables
# Required for AI postprocessing
export ANTHROPIC_API_KEY="your-anthropic-api-key"
# Optional: Control GPU usage
export CUDA_VISIBLE_DEVICES="0" # Use specific GPU
export CUDA_VISIBLE_DEVICES="" # Force CPU-only mode
API Configuration
import pdf2epub
# Configure default settings
pdf2epub.config.set_default_batch_multiplier(3)
pdf2epub.config.set_default_ai_provider("anthropic")
๐งช Testing
Run the test suite:
pytest # Run all tests
pytest --cov=pdf2epub # Run with coverage
pytest tests/test_pdf2md.py # Run specific test file
Current test coverage: 49% with 100% pass rate (41/41 tests)
๐ Plugin System
Create custom AI postprocessing providers:
from pdf2epub.postprocessing.ai import AIPostprocessor
class CustomAIProvider:
@staticmethod
def getjsonparams(system_prompt: str, request: str) -> str:
# Implement your AI API integration
return process_with_custom_ai(system_prompt, request)
# Register and use your provider
processor = AIPostprocessor(work_dir)
processor.register_provider("custom", CustomAIProvider)
processor.run_postprocessing(markdown_file, "custom")
๐ Performance
Benchmarks
| Document Type | Pages | Processing Time | Memory Usage |
|---|---|---|---|
| Research Paper | 20 | 45 seconds | 2.1 GB |
| Technical Book | 200 | 6 minutes | 4.8 GB |
| Magazine | 50 | 2 minutes | 1.9 GB |
Results on NVIDIA RTX 3080 with 16GB RAM
Optimization Tips
- Use GPU acceleration for 3-5x speed improvement
- Adjust batch multiplier based on available memory
- Process in chunks for very large documents
- Enable AI postprocessing for best quality (slower)
๐ Comparison
| Feature | PDF2EPUB | calibre | pandoc |
|---|---|---|---|
| AI Enhancement | โ | โ | โ |
| Layout Detection | โ | โ ๏ธ | โ ๏ธ |
| GPU Acceleration | โ | โ | โ |
| Python API | โ | โ ๏ธ | โ ๏ธ |
| Plugin System | โ | โ | โ |
| CLI Interface | โ | โ | โ |
๐ข Deployment
Docker
FROM python:3.11-slim
RUN pip install pdf2epub[full]
WORKDIR /workspace
ENTRYPOINT ["pdf2epub"]
GitHub Actions
- name: Convert PDFs
run: |
pip install pdf2epub[full]
pdf2epub documents/*.pdf
Production Deployment
import pdf2epub
from pathlib import Path
def production_converter(pdf_path: str) -> dict:
"""Production-ready PDF conversion with error handling."""
try:
output_dir = pdf2epub.convert_pdf_to_markdown(
pdf_path,
batch_multiplier=2, # Conservative memory usage
max_pages=1000 # Prevent runaway processing
)
epub_path = pdf2epub.convert_to_epub(output_dir)
return {
"status": "success",
"markdown_path": output_dir,
"epub_path": epub_path,
"processing_time": time.time() - start_time
}
except Exception as e:
return {
"status": "error",
"error": str(e)
}
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
Quick Contributing Steps
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make your changes and add tests
- Test your changes:
pytest - Format code:
black . - Submit a pull request
See CONTRIBUTING.md for detailed guidelines.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
This project builds upon excellent open-source libraries:
- marker-pdf - PDF processing engine
- mark2epub - Markdown to EPUB conversion
- PyTorch - GPU acceleration framework
- Transformers - AI/ML text processing
- Anthropic - AI API for text enhancement
๐ Project Status
- Version: 0.1.0 (Beta)
- Status: Active development
- Python: 3.9+ supported
- Testing: 49% coverage, 100% pass rate
- CI/CD: GitHub Actions
- Documentation: Comprehensive
๐ Links
๐ Support
- GitHub Issues: Report bugs or request features
- GitHub Discussions: Ask questions and get help
- Documentation: Browse the docs
Transform your PDFs into beautiful, accessible EPUBs with AI-powered enhancement! ๐๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2epub-0.1.0.tar.gz.
File metadata
- Download URL: pdf2epub-0.1.0.tar.gz
- Upload date:
- Size: 54.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6052eb35c9a2b18e6cff9d5e15f665669b825bb82c414764c0b36afa3101793c
|
|
| MD5 |
d55d6347bf28f0187e6eab1051675726
|
|
| BLAKE2b-256 |
e96e602b4b2722206d677ed0204ee0723551be51b2f21a6bedea930a51b9ecbd
|
Provenance
The following attestation bundles were made for pdf2epub-0.1.0.tar.gz:
Publisher:
ci.yml on porfanid/pdf2epub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2epub-0.1.0.tar.gz -
Subject digest:
6052eb35c9a2b18e6cff9d5e15f665669b825bb82c414764c0b36afa3101793c - Sigstore transparency entry: 368119675
- Sigstore integration time:
-
Permalink:
porfanid/pdf2epub@9a3176168c957daca1c74d8263b9b1032c716d46 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/porfanid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@9a3176168c957daca1c74d8263b9b1032c716d46 -
Trigger Event:
push
-
Statement type:
File details
Details for the file pdf2epub-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdf2epub-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23fa1561e51d6aa6e6f929d45e4a43d4693113d2c65f627ba8f83a47970a6144
|
|
| MD5 |
f11b05ac5724e55d77f5b8cbe505d7cf
|
|
| BLAKE2b-256 |
1cef0f8bb03fd88e7204bfa0151dc97650e471427e1700a34923566c065bdb5e
|
Provenance
The following attestation bundles were made for pdf2epub-0.1.0-py3-none-any.whl:
Publisher:
ci.yml on porfanid/pdf2epub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pdf2epub-0.1.0-py3-none-any.whl -
Subject digest:
23fa1561e51d6aa6e6f929d45e4a43d4693113d2c65f627ba8f83a47970a6144 - Sigstore transparency entry: 368119702
- Sigstore integration time:
-
Permalink:
porfanid/pdf2epub@9a3176168c957daca1c74d8263b9b1032c716d46 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/porfanid
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@9a3176168c957daca1c74d8263b9b1032c716d46 -
Trigger Event:
push
-
Statement type: