Skip to main content

Convert PDF files to EPUB format via Markdown with intelligent layout detection

Project description

PDF2EPUB ๐Ÿ“š

PyPI version CI/CD Pipeline Python 3.9+ License: MIT

A powerful Python package for converting PDF files to EPUB format via Markdown with intelligent layout detection, AI-powered postprocessing, and seamless CLI/API integration.

โœจ Features

  • ๐Ÿ“– Smart Layout Detection - Handles books, academic papers, and complex documents
  • ๐Ÿ” Advanced PDF Processing - OCR, table detection, and image extraction
  • ๐Ÿค– AI Postprocessing - Enhance quality with Anthropic Claude integration
  • ๐Ÿ“ Clean Markdown Output - Structured, readable markdown with preserved formatting
  • ๐Ÿ“ฑ Professional EPUB - High-quality EPUB 3.0 output with customizable styling
  • ๐ŸŒ Multi-language Support - Process documents in multiple languages
  • ๐Ÿš€ GPU Acceleration - NVIDIA CUDA and AMD ROCm support for faster processing
  • ๐ŸŽ Apple Silicon Support - Optimized performance on Apple Silicon devices
  • ๐Ÿ› ๏ธ Flexible API - Use as CLI tool or import as Python library
  • ๐Ÿ”Œ Plugin Architecture - Extensible AI provider system

๐Ÿš€ Quick Start

Installation

# Basic installation
pip install pdf2epub

# Full installation with all features
pip install pdf2epub[full]

Command Line Usage

# Convert a PDF to EPUB
pdf2epub document.pdf

# Advanced options
pdf2epub book.pdf --start-page 10 --max-pages 50 --langs "English,German"

Python API

  • For Apple Silicon, install with MPS support:
pip3 uninstall torch torchvision torchaudio
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  • For Apple Silicon, install with MPS support:
pip3 uninstall torch torchvision torchaudio
pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
  1. Verify GPU support:
import torch
print(torch.__version__)  # PyTorch version
print(torch.cuda.is_available())  # Should return True for NVIDIA
print(torch.mps.is_available())  # Should return True for Apple Silicon
print(torch.version.hip)  # Should print ROCm version for AMD

import pdf2epub

# Simple conversion
pdf2epub.convert_pdf_to_markdown("document.pdf", "output/")
pdf2epub.convert_markdown_to_epub("output/", "final/")

# Advanced usage with AI enhancement
processor = pdf2epub.AIPostprocessor("output/")
processor.run_postprocessing("document.md", "anthropic")

๐Ÿ“ฆ Installation Options

Basic Installation

pip install pdf2epub

Includes core functionality with minimal dependencies.

Full Installation

pip install pdf2epub[full]

Includes all features: PDF processing, AI postprocessing, and GPU acceleration.

Development Installation

pip install pdf2epub[dev]

Includes development tools: testing, linting, and formatting.

GPU Support

NVIDIA CUDA:

pip install pdf2epub[full]
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

AMD ROCm:

pip install pdf2epub[full]
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2

๐Ÿ“š Documentation

๐ŸŽฏ Use Cases

Academic Research

  • Convert research papers to readable EPUB format
  • Extract and preserve mathematical equations
  • Maintain citation formatting and structure

Digital Publishing

  • Transform print-ready PDFs into distribution-ready EPUBs
  • Preserve complex layouts and formatting
  • Optimize for e-reader compatibility

Document Archival

  • Convert legacy documents to modern formats
  • Batch process document collections
  • Enhance readability with AI postprocessing

Accessibility

  • Create screen-reader compatible versions
  • Improve text structure and navigation
  • Add semantic markup for better accessibility

๐Ÿ”ง Configuration

Environment Variables

# Required for AI postprocessing
export ANTHROPIC_API_KEY="your-anthropic-api-key"

# Optional: Control GPU usage
export CUDA_VISIBLE_DEVICES="0"  # Use specific GPU
export CUDA_VISIBLE_DEVICES=""   # Force CPU-only mode

API Configuration

import pdf2epub

# Configure default settings
pdf2epub.config.set_default_batch_multiplier(3)
pdf2epub.config.set_default_ai_provider("anthropic")

๐Ÿงช Testing

Run the test suite:

pytest                    # Run all tests
pytest --cov=pdf2epub   # Run with coverage
pytest tests/test_pdf2md.py  # Run specific test file

Current test coverage: 49% with 100% pass rate (41/41 tests)

๐Ÿ”Œ Plugin System

Create custom AI postprocessing providers:

from pdf2epub.postprocessing.ai import AIPostprocessor

class CustomAIProvider:
    @staticmethod
    def getjsonparams(system_prompt: str, request: str) -> str:
        # Implement your AI API integration
        return process_with_custom_ai(system_prompt, request)

# Register and use your provider
processor = AIPostprocessor(work_dir)
processor.register_provider("custom", CustomAIProvider)
processor.run_postprocessing(markdown_file, "custom")

๐Ÿ“Š Performance

Benchmarks

Document Type Pages Processing Time Memory Usage
Research Paper 20 45 seconds 2.1 GB
Technical Book 200 6 minutes 4.8 GB
Magazine 50 2 minutes 1.9 GB

Results on NVIDIA RTX 3080 with 16GB RAM

Optimization Tips

  • Use GPU acceleration for 3-5x speed improvement
  • Adjust batch multiplier based on available memory
  • Process in chunks for very large documents
  • Enable AI postprocessing for best quality (slower)

๐Ÿ†š Comparison

Feature PDF2EPUB calibre pandoc
AI Enhancement โœ… โŒ โŒ
Layout Detection โœ… โš ๏ธ โš ๏ธ
GPU Acceleration โœ… โŒ โŒ
Python API โœ… โš ๏ธ โš ๏ธ
Plugin System โœ… โœ… โŒ
CLI Interface โœ… โœ… โœ…

๐Ÿšข Deployment

Docker

FROM python:3.11-slim

RUN pip install pdf2epub[full]

WORKDIR /workspace
ENTRYPOINT ["pdf2epub"]

GitHub Actions

- name: Convert PDFs
  run: |
    pip install pdf2epub[full]
    pdf2epub documents/*.pdf

Production Deployment

import pdf2epub
from pathlib import Path

def production_converter(pdf_path: str) -> dict:
    """Production-ready PDF conversion with error handling."""
    try:
        output_dir = pdf2epub.convert_pdf_to_markdown(
            pdf_path, 
            batch_multiplier=2,  # Conservative memory usage
            max_pages=1000      # Prevent runaway processing
        )
        
        epub_path = pdf2epub.convert_to_epub(output_dir)
        
        return {
            "status": "success",
            "markdown_path": output_dir,
            "epub_path": epub_path,
            "processing_time": time.time() - start_time
        }
        
    except Exception as e:
        return {
            "status": "error", 
            "error": str(e)
        }

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

Quick Contributing Steps

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make your changes and add tests
  4. Test your changes: pytest
  5. Format code: black .
  6. Submit a pull request

See CONTRIBUTING.md for detailed guidelines.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

This project builds upon excellent open-source libraries:

๐Ÿ“ˆ Project Status

  • Version: 0.1.0 (Beta)
  • Status: Active development
  • Python: 3.9+ supported
  • Testing: 49% coverage, 100% pass rate
  • CI/CD: GitHub Actions
  • Documentation: Comprehensive

๐Ÿ”— Links

๐Ÿ“ž Support


Transform your PDFs into beautiful, accessible EPUBs with AI-powered enhancement! ๐Ÿš€๐Ÿ“š

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2epub-0.1.0.tar.gz (54.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2epub-0.1.0-py3-none-any.whl (38.5 kB view details)

Uploaded Python 3

File details

Details for the file pdf2epub-0.1.0.tar.gz.

File metadata

  • Download URL: pdf2epub-0.1.0.tar.gz
  • Upload date:
  • Size: 54.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2epub-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6052eb35c9a2b18e6cff9d5e15f665669b825bb82c414764c0b36afa3101793c
MD5 d55d6347bf28f0187e6eab1051675726
BLAKE2b-256 e96e602b4b2722206d677ed0204ee0723551be51b2f21a6bedea930a51b9ecbd

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2epub-0.1.0.tar.gz:

Publisher: ci.yml on porfanid/pdf2epub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pdf2epub-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2epub-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for pdf2epub-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 23fa1561e51d6aa6e6f929d45e4a43d4693113d2c65f627ba8f83a47970a6144
MD5 f11b05ac5724e55d77f5b8cbe505d7cf
BLAKE2b-256 1cef0f8bb03fd88e7204bfa0151dc97650e471427e1700a34923566c065bdb5e

See more details on using hashes here.

Provenance

The following attestation bundles were made for pdf2epub-0.1.0-py3-none-any.whl:

Publisher: ci.yml on porfanid/pdf2epub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page