A powerful academic PDF to EPUB converter with AI-powered layout detection and LaTeX math support
Project description
paper2epub
A powerful academic PDF to EPUB converter with AI-powered layout detection and LaTeX math support.
Features
- Academic-First Design: Optimized for scientific papers, research documents, and technical publications
- LaTeX Math Support: Preserves mathematical equations using Nougat's neural OCR
- Complex Layout Handling: AI-powered detection of multi-column layouts, tables, and figures
- GPU Acceleration: Optional CUDA/MPS (Apple Silicon) support for faster processing
- Figure Extraction: Automatic extraction and embedding of figures using PyMuPDF
- Multiple Output Formats: EPUB3 with optional intermediate Markdown
- Easy to Use: Both CLI and Python API available
Installation
Basic Installation
pip install paper2epub
From Source
git clone https://github.com/MAXNORM8650/paper2epub.git
cd paper2epub
pip install -e .
Development Installation
pip install -e ".[dev]"
Requirements
- Python 3.9+
- PyTorch 2.0+
- For GPU acceleration:
- NVIDIA GPU: CUDA-enabled PyTorch
- Apple Silicon (M1/M2/M3): MPS-enabled PyTorch (included by default)
Quick Start
Command Line
# Basic conversion
paper2epub paper.pdf
# Specify output and metadata
paper2epub paper.pdf -o output.epub -t "My Paper" -a "John Doe"
# Use larger model with GPU
paper2epub paper.pdf -m base -d cuda
# Save intermediate markdown
paper2epub paper.pdf --save-markdown
# Skip figure extraction
paper2epub paper.pdf --no-figures
# Set minimum figure size (filter small images)
paper2epub paper.pdf --figure-min-size 150
Python API
from paper2epub import Paper2EpubConverter
# Initialize converter
converter = Paper2EpubConverter(
model_tag="0.1.0-small", # or "0.1.0-base" for better quality
device="auto", # auto-detect GPU/CPU
extract_figures=True, # enable figure extraction
figure_min_size=100, # minimum figure size in pixels
)
# Convert PDF to EPUB
output_path = converter.convert(
pdf_path="paper.pdf",
title="My Academic Paper",
author="John Doe",
save_markdown=True, # optionally save .md file
)
print(f"Created: {output_path}")
CLI Options
Usage: paper2epub [OPTIONS] PDF_PATH
Options:
-o, --output PATH Output EPUB file path
-t, --title TEXT Book title
-a, --author TEXT Author name
-l, --language TEXT Language code (default: en)
-m, --model [small|base] Nougat model size (default: small)
-d, --device [auto|cuda|mps|cpu] Device to use
-b, --batch-size INT Batch size for processing
--save-markdown Save intermediate markdown file
--no-figures Skip figure extraction from PDF
--figure-min-size INT Minimum figure size in pixels (default: 100)
-v, --verbose Enable verbose logging
--version Show version
--help Show this message and exit
How It Works
paper2epub uses a multi-stage pipeline:
- PDF Extraction: Nougat (Meta's neural OCR) extracts text, tables, and LaTeX equations
- Figure Extraction: PyMuPDF extracts embedded images from the PDF
- Markdown Generation: Content is converted to Markdown with preserved structure
- EPUB Creation: Markdown and images are transformed into EPUB3 with MathML/MathJax support
Why Nougat?
Nougat (Neural Optical Understanding for Academic Documents) is Meta's state-of-the-art model specifically designed for academic papers. It excels at:
- Recognizing complex mathematical notation
- Handling multi-column layouts
- Preserving table structures
- Extracting figures and captions
Model Sizes
| Model | Size | Speed | Quality | Use Case |
|---|---|---|---|---|
| small | ~350MB | Fast | Good | Quick conversions, testing |
| base | ~1.2GB | Moderate | Better | Production use, complex papers |
Performance
- CPU: 1-3 pages/minute (small model)
- GPU (CUDA): 10-20 pages/minute
- Apple Silicon (MPS): 5-15 pages/minute
Examples
Convert Multiple PDFs
for pdf in *.pdf; do
paper2epub "$pdf" -a "Author Name"
done
Batch Processing in Python
from pathlib import Path
from paper2epub import Paper2EpubConverter
converter = Paper2EpubConverter()
pdf_dir = Path("papers")
for pdf_file in pdf_dir.glob("*.pdf"):
print(f"Converting {pdf_file.name}...")
converter.convert(pdf_file)
Limitations
- Scanned PDFs may require higher quality OCR (use base model)
- Very complex equations might need manual review
- Image quality depends on source PDF resolution
- EPUB readers vary in math rendering support (MathJax recommended)
Troubleshooting
Dependency Conflicts
Issue 1: albumentations
If you get an error about albumentations or ImageCompression:
# Install compatible version
pip install 'albumentations<1.4.0'
Issue 2: pypdfium2 (PdfDocument has no attribute 'render')
If you get an error about 'PdfDocument' object has no attribute 'render':
# Install compatible version
pip install 'pypdfium2>=4.0.0,<5.0.0'
Or reinstall with all fixes:
pip install --upgrade paper2epub
Out of Memory
# Reduce batch size
paper2epub paper.pdf -b 1
# Use CPU instead of GPU
paper2epub paper.pdf -d cpu
Poor Quality Output
# Use larger model
paper2epub paper.pdf -m base
# Enable verbose logging to debug
paper2epub paper.pdf -v
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - see LICENSE file for details.
Acknowledgments
Citation
If you use paper2epub in academic work, please cite:
@software{paper2epub,
title = {paper2epub: Academic PDF to EPUB Converter},
author = {Komal Kumar},
year = {2026},
url = {https://github.com/MAXNORM8650/paper2epub}
}
For Nougat:
@article{blecher2023nougat,
title={Nougat: Neural Optical Understanding for Academic Documents},
author={Blecher, Lukas and Cucurull, Guillem and Scialom, Thomas and Stojnic, Robert},
journal={arXiv preprint arXiv:2308.13418},
year={2023}
}
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Roadmap
- GROBID integration for better metadata extraction
- Support for more input formats (DOCX, LaTeX)
- Batch processing UI
- Cloud/API deployment option
- Enhanced equation rendering options
- Custom styling templates
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paper2epub-0.2.0.tar.gz.
File metadata
- Download URL: paper2epub-0.2.0.tar.gz
- Upload date:
- Size: 23.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f53d9a8d0e396b171dc84c8d9f164a7d1e2d02db4d8890a5e269b140dcb12622
|
|
| MD5 |
b8d052599d1c83b3dfc9e131835cb641
|
|
| BLAKE2b-256 |
9cae07e2a745a5ec3486bfd59a21d76a76ca04624ae07b20f91add80dd01ec82
|
Provenance
The following attestation bundles were made for paper2epub-0.2.0.tar.gz:
Publisher:
publish.yml on MAXNORM8650/paper2epub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper2epub-0.2.0.tar.gz -
Subject digest:
f53d9a8d0e396b171dc84c8d9f164a7d1e2d02db4d8890a5e269b140dcb12622 - Sigstore transparency entry: 815284185
- Sigstore integration time:
-
Permalink:
MAXNORM8650/paper2epub@bbd6f82d64bb89cd0cc66e9b7a714edf00de0e8f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/MAXNORM8650
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bbd6f82d64bb89cd0cc66e9b7a714edf00de0e8f -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file paper2epub-0.2.0-py3-none-any.whl.
File metadata
- Download URL: paper2epub-0.2.0-py3-none-any.whl
- Upload date:
- Size: 16.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3de32e19e90897b4cf5be1f6333592f522a8d0c141bcc6d8097a848379c3994d
|
|
| MD5 |
d31641e16354a8c32a95e29714ede121
|
|
| BLAKE2b-256 |
6b112cb5262301fc470ffaacd3d2a58f578c91bf969a2dd6288a2b431a594417
|
Provenance
The following attestation bundles were made for paper2epub-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on MAXNORM8650/paper2epub
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
paper2epub-0.2.0-py3-none-any.whl -
Subject digest:
3de32e19e90897b4cf5be1f6333592f522a8d0c141bcc6d8097a848379c3994d - Sigstore transparency entry: 815284193
- Sigstore integration time:
-
Permalink:
MAXNORM8650/paper2epub@bbd6f82d64bb89cd0cc66e9b7a714edf00de0e8f -
Branch / Tag:
refs/heads/main - Owner: https://github.com/MAXNORM8650
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@bbd6f82d64bb89cd0cc66e9b7a714edf00de0e8f -
Trigger Event:
workflow_dispatch
-
Statement type: