Python CLI for turning PDF books into clean, readable EPUB files
Project description
pdf2epub
pdf2epub is a Python CLI for converting PDF books into readable EPUB files.
The scope is intentionally narrow for now: one solid pipeline, PDF -> EPUB.
Requires Python 3.12+.
What it does
- reads a PDF and extracts text page by page
- supports both text-layer PDFs and scanned or image-only PDFs
- can call a local or self-hosted OpenAI-compatible model for OCR cleanup and Markdown structuring
- builds an EPUB from the cleaned and reflowed content instead of dumping raw text
Install
Published package name on PyPI: pdf2epub-cli
Installed command name: pdf2epub
If you only want the CLI on your machine, prefer uv tool or pipx:
uv tool install pdf2epub-cli
pdf2epub --help
pipx install pdf2epub-cli
pdf2epub --help
If you want it inside an existing Python environment:
pip install pdf2epub-cli
pdf2epub --help
Run from source
Sync dependencies:
uv sync
Run the CLI in the repo:
uv run pdf2epub --help
If you want a local editable command while developing in the repo:
uv tool install -e .
pdf2epub --help
You can also install it directly from GitHub:
uv tool install git+https://github.com/ahpxex/pdf2epub
pdf2epub --help
The PyPI distribution is called pdf2epub-cli because the plain
pdf2epub package name is already occupied on PyPI. The executable command
remains pdf2epub.
CLI usage
uv run pdf2epub <pdf_path> [options]
Common options:
-o, --output <path>: output EPUB path, default is<pdf_name>.epub--page <n>: print text from a single 1-based page and exit--title <text>: override EPUB title--author <text>: override EPUB author--language <code>: override language code such asenorzh--text-only: print extracted text only, do not generate EPUB--extract-mode <auto|native|llm>: choose the extraction strategy--batch-size <n>: pages per LLM batch, default5--max-workers <n>: concurrent LLM requests, default4--llm-model <name>: model name--llm-base-url <url>: OpenAI-compatible base URL--llm-api-key <key>: API key--llm-timeout <seconds>: per-request timeout, default120--llm-temperature <value>: sampling temperature, default0.0
Most common workflows
Convert a text-based PDF
If the PDF already has a selectable text layer, prefer native mode:
uv run pdf2epub ./book.pdf --extract-mode native
This writes ./book.epub by default.
Set output path and metadata
uv run pdf2epub ./book.pdf \
--extract-mode native \
--output ./out/book.epub \
--title "Custom Title" \
--author "Author Name" \
--language en
Inspect one page
uv run pdf2epub ./book.pdf --page 12
This is useful when you want to debug extraction quality on a specific page.
Print cleaned text only
uv run pdf2epub ./book.pdf --text-only --extract-mode native
Convert a scanned PDF
For a local Ollama server or any other OpenAI-compatible endpoint:
export PDF2EPUB_LLM_BASE_URL=http://127.0.0.1:11434/v1
export PDF2EPUB_LLM_API_KEY=dummy
export PDF2EPUB_LLM_MODEL=glm-ocr
uv run pdf2epub ./scanned-book.pdf --extract-mode llm
You can also pass the same config via flags:
uv run pdf2epub ./scanned-book.pdf \
--extract-mode llm \
--llm-model glm-ocr \
--llm-base-url http://127.0.0.1:11434/v1 \
--llm-api-key dummy
Extraction modes
native
Reads the embedded PDF text layer directly.
- best for normal text PDFs
- fast
- does not require a model
- does not work for image-only scanned PDFs
uv run pdf2epub ./book.pdf --extract-mode native
llm
Uses the LLM OCR / cleanup / Markdown structuring pipeline.
- best for scanned PDFs, image-heavy PDFs, or noisy extraction
- more robust on image pages
- slower than native mode
- requires model configuration
uv run pdf2epub ./scan.pdf --extract-mode llm
auto
Chooses between native and llm based on text density.
- in
--text-onlymode, it can fall back automatically - in normal EPUB generation mode, you should still configure the LLM in advance, because the document may be classified as OCR-needed
If you already know the PDF has a clean text layer, --extract-mode native is
usually the safer choice.
LLM configuration
pdf2epub looks for these environment variables first:
PDF2EPUB_LLM_MODELPDF2EPUB_LLM_BASE_URLPDF2EPUB_LLM_API_KEY
It also accepts these standard fallbacks:
OPENAI_MODELOPENAI_BASE_URLOPENAI_API_KEY
Example .env:
PDF2EPUB_LLM_BASE_URL=http://127.0.0.1:11434/v1
PDF2EPUB_LLM_API_KEY=dummy
PDF2EPUB_LLM_MODEL=glm-ocr
Output behavior
- default output path: same directory as the source PDF, with an
.epubextension - if
--title,--author, or--languageare omitted, the CLI tries to infer them from PDF metadata and extracted text - if text extraction succeeds but no chapters can be built, the CLI exits with an error instead of silently generating a broken EPUB
Testing
Run the unit test suite:
uv run pytest
Run the live local-model integration test:
PDF2EPUB_RUN_LIVE_LLM=1 uv run pytest -m live_llm -s
Benchmarks and local book fixtures
- manual end-to-end artifacts live under
benchmarks/artifacts/e2e/ - local linked book fixtures live under
benchmarks/local/downloads_books/files/ - automated quality regression script:
uv run python scripts/run_quality_regression.py
Naming and scope
The project used to be called any2epub. It is now intentionally narrowed to
pdf2epub: the current goal is not "convert anything to EPUB", but to make the
PDF-to-EPUB pipeline stable and high quality first.
For package distribution, PyPI uses pdf2epub-cli while the executable command
stays pdf2epub.
License
This project is licensed under AGPL-3.0-or-later. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pdf2epub_cli-0.1.0.tar.gz.
File metadata
- Download URL: pdf2epub_cli-0.1.0.tar.gz
- Upload date:
- Size: 54.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a96d28052d40da88cb48e9b0b7c54ee9422a18b4f8729c95162d83c0af7c6ed
|
|
| MD5 |
db048e49bf3af694bc1661e3cff0cb2c
|
|
| BLAKE2b-256 |
3c3404ebbbe140a0ddb8c0cc3267ad92dbd5e826a76b256f32fd173ed747d270
|
File details
Details for the file pdf2epub_cli-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pdf2epub_cli-0.1.0-py3-none-any.whl
- Upload date:
- Size: 58.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b36ebf9865570f1d6b08bc6f6a5998b797d78c93c2ffb55db5bc2152ca2c402
|
|
| MD5 |
70232320d5f60db9855fe347589f5939
|
|
| BLAKE2b-256 |
032828a4ca810431f5156d2f20276d62b75c54d902801ac42e28ad908800834c
|