Skip to main content

A powerful academic PDF to EPUB converter with AI-powered layout detection and LaTeX math support

Project description

paper2epub

A powerful academic PDF to EPUB converter with AI-powered layout detection and LaTeX math support.

Features

  • Academic-First Design: Optimized for scientific papers, research documents, and technical publications
  • LaTeX Math Support: Preserves mathematical equations using Nougat's neural OCR
  • Complex Layout Handling: AI-powered detection of multi-column layouts, tables, and figures
  • GPU Acceleration: Optional CUDA/MPS (Apple Silicon) support for faster processing
  • Figure Extraction: Automatic extraction and embedding of figures using PyMuPDF
  • Multiple Output Formats: EPUB3 with optional intermediate Markdown
  • Easy to Use: Both CLI and Python API available

Installation

Basic Installation

pip install paper2epub

From Source

git clone https://github.com/MAXNORM8650/paper2epub.git
cd paper2epub
pip install -e .

Development Installation

pip install -e ".[dev]"

Requirements

  • Python 3.9+
  • PyTorch 2.0+
  • For GPU acceleration:
    • NVIDIA GPU: CUDA-enabled PyTorch
    • Apple Silicon (M1/M2/M3): MPS-enabled PyTorch (included by default)

Quick Start

Command Line

# Basic conversion
paper2epub paper.pdf

# Specify output and metadata
paper2epub paper.pdf -o output.epub -t "My Paper" -a "John Doe"

# Use larger model with GPU
paper2epub paper.pdf -m base -d cuda

# Save intermediate markdown
paper2epub paper.pdf --save-markdown

# Skip figure extraction
paper2epub paper.pdf --no-figures

# Set minimum figure size (filter small images)
paper2epub paper.pdf --figure-min-size 150

Python API

from paper2epub import Paper2EpubConverter

# Initialize converter
converter = Paper2EpubConverter(
    model_tag="0.1.0-small",  # or "0.1.0-base" for better quality
    device="auto",             # auto-detect GPU/CPU
    extract_figures=True,      # enable figure extraction
    figure_min_size=100,       # minimum figure size in pixels
)

# Convert PDF to EPUB
output_path = converter.convert(
    pdf_path="paper.pdf",
    title="My Academic Paper",
    author="John Doe",
    save_markdown=True,        # optionally save .md file
)

print(f"Created: {output_path}")

CLI Options

Usage: paper2epub [OPTIONS] PDF_PATH

Options:
  -o, --output PATH          Output EPUB file path
  -t, --title TEXT           Book title
  -a, --author TEXT          Author name
  -l, --language TEXT        Language code (default: en)
  -m, --model [small|base]   Nougat model size (default: small)
  -d, --device [auto|cuda|mps|cpu]  Device to use
  -b, --batch-size INT       Batch size for processing
  --save-markdown            Save intermediate markdown file
  --no-figures               Skip figure extraction from PDF
  --figure-min-size INT      Minimum figure size in pixels (default: 100)
  -v, --verbose              Enable verbose logging
  --version                  Show version
  --help                     Show this message and exit

How It Works

paper2epub uses a multi-stage pipeline:

  1. PDF Extraction: Nougat (Meta's neural OCR) extracts text, tables, and LaTeX equations
  2. Figure Extraction: PyMuPDF extracts embedded images from the PDF
  3. Markdown Generation: Content is converted to Markdown with preserved structure
  4. EPUB Creation: Markdown and images are transformed into EPUB3 with MathML/MathJax support

Why Nougat?

Nougat (Neural Optical Understanding for Academic Documents) is Meta's state-of-the-art model specifically designed for academic papers. It excels at:

  • Recognizing complex mathematical notation
  • Handling multi-column layouts
  • Preserving table structures
  • Extracting figures and captions

Model Sizes

Model Size Speed Quality Use Case
small ~350MB Fast Good Quick conversions, testing
base ~1.2GB Moderate Better Production use, complex papers

Performance

  • CPU: 1-3 pages/minute (small model)
  • GPU (CUDA): 10-20 pages/minute
  • Apple Silicon (MPS): 5-15 pages/minute

Examples

Convert Multiple PDFs

for pdf in *.pdf; do
    paper2epub "$pdf" -a "Author Name"
done

Batch Processing in Python

from pathlib import Path
from paper2epub import Paper2EpubConverter

converter = Paper2EpubConverter()

pdf_dir = Path("papers")
for pdf_file in pdf_dir.glob("*.pdf"):
    print(f"Converting {pdf_file.name}...")
    converter.convert(pdf_file)

Limitations

  • Scanned PDFs may require higher quality OCR (use base model)
  • Very complex equations might need manual review
  • Image quality depends on source PDF resolution
  • EPUB readers vary in math rendering support (MathJax recommended)

Troubleshooting

Dependency Conflicts

Issue 1: albumentations

If you get an error about albumentations or ImageCompression:

# Install compatible version
pip install 'albumentations<1.4.0'

Issue 2: pypdfium2 (PdfDocument has no attribute 'render')

If you get an error about 'PdfDocument' object has no attribute 'render':

# Install compatible version
pip install 'pypdfium2>=4.0.0,<5.0.0'

Or reinstall with all fixes:

pip install --upgrade paper2epub

Out of Memory

# Reduce batch size
paper2epub paper.pdf -b 1

# Use CPU instead of GPU
paper2epub paper.pdf -d cpu

Poor Quality Output

# Use larger model
paper2epub paper.pdf -m base

# Enable verbose logging to debug
paper2epub paper.pdf -v

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - see LICENSE file for details.

Acknowledgments

Citation

If you use paper2epub in academic work, please cite:

@software{paper2epub,
  title = {paper2epub: Academic PDF to EPUB Converter},
  author = {Komal Kumar},
  year = {2026},
  url = {https://github.com/MAXNORM8650/paper2epub}
}

For Nougat:

@article{blecher2023nougat,
  title={Nougat: Neural Optical Understanding for Academic Documents},
  author={Blecher, Lukas and Cucurull, Guillem and Scialom, Thomas and Stojnic, Robert},
  journal={arXiv preprint arXiv:2308.13418},
  year={2023}
}

Support

Roadmap

  • GROBID integration for better metadata extraction
  • Support for more input formats (DOCX, LaTeX)
  • Batch processing UI
  • Cloud/API deployment option
  • Enhanced equation rendering options
  • Custom styling templates

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paper2epub-0.2.0.tar.gz (23.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paper2epub-0.2.0-py3-none-any.whl (16.1 kB view details)

Uploaded Python 3

File details

Details for the file paper2epub-0.2.0.tar.gz.

File metadata

  • Download URL: paper2epub-0.2.0.tar.gz
  • Upload date:
  • Size: 23.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper2epub-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f53d9a8d0e396b171dc84c8d9f164a7d1e2d02db4d8890a5e269b140dcb12622
MD5 b8d052599d1c83b3dfc9e131835cb641
BLAKE2b-256 9cae07e2a745a5ec3486bfd59a21d76a76ca04624ae07b20f91add80dd01ec82

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper2epub-0.2.0.tar.gz:

Publisher: publish.yml on MAXNORM8650/paper2epub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file paper2epub-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: paper2epub-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 16.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for paper2epub-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3de32e19e90897b4cf5be1f6333592f522a8d0c141bcc6d8097a848379c3994d
MD5 d31641e16354a8c32a95e29714ede121
BLAKE2b-256 6b112cb5262301fc470ffaacd3d2a58f578c91bf969a2dd6288a2b431a594417

See more details on using hashes here.

Provenance

The following attestation bundles were made for paper2epub-0.2.0-py3-none-any.whl:

Publisher: publish.yml on MAXNORM8650/paper2epub

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page