Convert PDF files to EPUB format via Markdown with intelligent layout detection

These details have not been verified by PyPI

Project description

PDF2EPUB 📚

A powerful Python package for converting PDF files to EPUB format via Markdown with intelligent layout detection, AI-powered postprocessing, and seamless CLI/API integration.

✨ Features

📖 Smart Layout Detection - Handles books, academic papers, and complex documents
🔍 Advanced PDF Processing - OCR, table detection, and image extraction
🤖 AI Postprocessing - Enhance quality with Anthropic Claude integration
📝 Clean Markdown Output - Structured, readable markdown with preserved formatting
📱 Professional EPUB - High-quality EPUB 3.0 output with customizable styling
🌍 Multi-language Support - Process documents in multiple languages
🚀 GPU Acceleration - NVIDIA CUDA and AMD ROCm support for faster processing
🍎 Apple Silicon Support - Optimized performance on Apple Silicon devices
🛠️ Flexible API - Use as CLI tool or import as Python library
🔌 Plugin Architecture - Extensible AI provider system

🚀 Quick Start

Installation

# Basic installation
pip install pdf2epub

# Full installation with all features
pip install pdf2epub[full]

Command Line Usage

# Convert a PDF to EPUB
pdf2epub document.pdf

# Advanced options
pdf2epub book.pdf --start-page 10 --max-pages 50 --langs "English,German"

Python API

For Apple Silicon, install with MPS support:

pip3 uninstall torch torchvision torchaudio
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

For Apple Silicon, install with MPS support:

pip3 uninstall torch torchvision torchaudio
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Verify GPU support:

import torch
print(torch.__version__)  # PyTorch version
print(torch.cuda.is_available())  # Should return True for NVIDIA
print(torch.mps.is_available())  # Should return True for Apple Silicon
print(torch.version.hip)  # Should print ROCm version for AMD

import pdf2epub

# Simple conversion
pdf2epub.convert_pdf_to_markdown("document.pdf", "output/")
pdf2epub.convert_markdown_to_epub("output/", "final/")

# Advanced usage with AI enhancement
processor = pdf2epub.AIPostprocessor("output/")
processor.run_postprocessing("document.md", "anthropic")

📦 Installation Options

Basic Installation

pip install pdf2epub

Includes core functionality with minimal dependencies.

Full Installation

pip install pdf2epub[full]

Includes all features: PDF processing, AI postprocessing, and GPU acceleration.

Development Installation

pip install pdf2epub[dev]

Includes development tools: testing, linting, and formatting.

GPU Support

NVIDIA CUDA:

pip install pdf2epub[full]
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

AMD ROCm:

pip install pdf2epub[full]
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

📚 Documentation

Quick Tutorial - Convert your first PDF in 5 minutes
Installation Guide - Detailed setup instructions
CLI Reference - Complete command-line documentation
Python API - Library usage and examples
Advanced Features - GPU acceleration, batch processing
AI Integration - Enhance quality with AI postprocessing
Plugin Development - Create custom AI providers

🎯 Use Cases

Academic Research

Convert research papers to readable EPUB format
Extract and preserve mathematical equations
Maintain citation formatting and structure

Digital Publishing

Transform print-ready PDFs into distribution-ready EPUBs
Preserve complex layouts and formatting
Optimize for e-reader compatibility

Document Archival

Convert legacy documents to modern formats
Batch process document collections
Enhance readability with AI postprocessing

Accessibility

Create screen-reader compatible versions
Improve text structure and navigation
Add semantic markup for better accessibility

🔧 Configuration

Environment Variables

# Required for AI postprocessing
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# Optional: Control GPU usage
export CUDA_VISIBLE_DEVICES="0"  # Use specific GPU
export CUDA_VISIBLE_DEVICES=""   # Force CPU-only mode

API Configuration

import pdf2epub

# Configure default settings
pdf2epub.config.set_default_batch_multiplier(3)
pdf2epub.config.set_default_ai_provider("anthropic")

🧪 Testing

Run the test suite:

pytest                    # Run all tests
pytest --cov=pdf2epub   # Run with coverage
pytest tests/test_pdf2md.py  # Run specific test file

Current test coverage: 49% with 100% pass rate (41/41 tests)

🔌 Plugin System

Create custom AI postprocessing providers:

from pdf2epub.postprocessing.ai import AIPostprocessor

class CustomAIProvider:
    @staticmethod
    def getjsonparams(system_prompt: str, request: str) -> str:
        # Implement your AI API integration
        return process_with_custom_ai(system_prompt, request)

# Register and use your provider
processor = AIPostprocessor(work_dir)
processor.register_provider("custom", CustomAIProvider)
processor.run_postprocessing(markdown_file, "custom")

📊 Performance

Benchmarks

Document Type	Pages	Processing Time	Memory Usage
Research Paper	20	45 seconds	2.1 GB
Technical Book	200	6 minutes	4.8 GB
Magazine	50	2 minutes	1.9 GB

Results on NVIDIA RTX 3080 with 16GB RAM

Optimization Tips

Use GPU acceleration for 3-5x speed improvement
Adjust batch multiplier based on available memory
Process in chunks for very large documents
Enable AI postprocessing for best quality (slower)

🆚 Comparison

Feature	PDF2EPUB	calibre	pandoc
AI Enhancement	✅	❌	❌
Layout Detection	✅	⚠️	⚠️
GPU Acceleration	✅	❌	❌
Python API	✅	⚠️	⚠️
Plugin System	✅	✅	❌
CLI Interface	✅	✅	✅

🚢 Deployment

Docker

FROM python:3.11-slim

RUN pip install pdf2epub[full]

WORKDIR /workspace
ENTRYPOINT ["pdf2epub"]

GitHub Actions

- name: Convert PDFs
  run: |
    pip install pdf2epub[full]
    pdf2epub documents/*.pdf

Production Deployment

import pdf2epub
from pathlib import Path

def production_converter(pdf_path: str) -> dict:
    """Production-ready PDF conversion with error handling."""
    try:
        output_dir = pdf2epub.convert_pdf_to_markdown(
            pdf_path, 
            batch_multiplier=2,  # Conservative memory usage
            max_pages=1000      # Prevent runaway processing
        )
        
        epub_path = pdf2epub.convert_to_epub(output_dir)
        
        return {
            "status": "success",
            "markdown_path": output_dir,
            "epub_path": epub_path,
            "processing_time": time.time() - start_time
        }
        
    except Exception as e:
        return {
            "status": "error", 
            "error": str(e)
        }

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Contributing Steps

Fork the repository
Create a feature branch: git checkout -b feature-name
Make your changes and add tests
Test your changes: pytest
Format code: black .
Submit a pull request

See CONTRIBUTING.md for detailed guidelines.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

This project builds upon excellent open-source libraries:

marker-pdf - PDF processing engine
mark2epub - Markdown to EPUB conversion
PyTorch - GPU acceleration framework
Transformers - AI/ML text processing
Anthropic - AI API for text enhancement

📈 Project Status

Version: 0.1.0 (Beta)
Status: Active development
Python: 3.9+ supported
Testing: 49% coverage, 100% pass rate
CI/CD: GitHub Actions
Documentation: Comprehensive

🔗 Links

📞 Support

GitHub Issues: Report bugs or request features
GitHub Discussions: Ask questions and get help
Documentation: Browse the docs

Transform your PDFs into beautiful, accessible EPUBs with AI-powered enhancement! 🚀📚

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.2

Aug 9, 2025

0.1.1

Aug 9, 2025

This version

0.1.0

Aug 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2epub-0.1.0.tar.gz (54.0 kB view details)

Uploaded Aug 8, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2epub-0.1.0-py3-none-any.whl (38.5 kB view details)

Uploaded Aug 8, 2025 Python 3

File details

Details for the file pdf2epub-0.1.0.tar.gz.

File metadata

Download URL: pdf2epub-0.1.0.tar.gz
Upload date: Aug 8, 2025
Size: 54.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2epub-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6052eb35c9a2b18e6cff9d5e15f665669b825bb82c414764c0b36afa3101793c`
MD5	`d55d6347bf28f0187e6eab1051675726`
BLAKE2b-256	`e96e602b4b2722206d677ed0204ee0723551be51b2f21a6bedea930a51b9ecbd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2epub-0.1.0.tar.gz:

Publisher: ci.yml on porfanid/pdf2epub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf2epub-0.1.0.tar.gz
- Subject digest: 6052eb35c9a2b18e6cff9d5e15f665669b825bb82c414764c0b36afa3101793c
- Sigstore transparency entry: 368119675
- Sigstore integration time: Aug 8, 2025
Source repository:
- Permalink: porfanid/pdf2epub@9a3176168c957daca1c74d8263b9b1032c716d46
- Branch / Tag: refs/heads/main
- Owner: https://github.com/porfanid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@9a3176168c957daca1c74d8263b9b1032c716d46
- Trigger Event: push

File details

Details for the file pdf2epub-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf2epub-0.1.0-py3-none-any.whl
Upload date: Aug 8, 2025
Size: 38.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2epub-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`23fa1561e51d6aa6e6f929d45e4a43d4693113d2c65f627ba8f83a47970a6144`
MD5	`f11b05ac5724e55d77f5b8cbe505d7cf`
BLAKE2b-256	`1cef0f8bb03fd88e7204bfa0151dc97650e471427e1700a34923566c065bdb5e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2epub-0.1.0-py3-none-any.whl:

Publisher: ci.yml on porfanid/pdf2epub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: pdf2epub-0.1.0-py3-none-any.whl
- Subject digest: 23fa1561e51d6aa6e6f929d45e4a43d4693113d2c65f627ba8f83a47970a6144
- Sigstore transparency entry: 368119702
- Sigstore integration time: Aug 8, 2025
Source repository:
- Permalink: porfanid/pdf2epub@9a3176168c957daca1c74d8263b9b1032c716d46
- Branch / Tag: refs/heads/main
- Owner: https://github.com/porfanid
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: ci.yml@9a3176168c957daca1c74d8263b9b1032c716d46
- Trigger Event: push

pdf2epub 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

PDF2EPUB 📚

✨ Features

🚀 Quick Start

Installation

Command Line Usage

Python API

📦 Installation Options

Basic Installation

Full Installation

Development Installation

GPU Support

📚 Documentation

🎯 Use Cases

Academic Research

Digital Publishing

Document Archival

Accessibility

🔧 Configuration

Environment Variables

API Configuration

🧪 Testing

🔌 Plugin System

📊 Performance

Benchmarks

Optimization Tips

🆚 Comparison

🚢 Deployment

Docker

GitHub Actions

Production Deployment

🤝 Contributing

Quick Contributing Steps

📄 License

🙏 Acknowledgments

📈 Project Status

🔗 Links

📞 Support

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance