Python CLI for turning PDF books into clean, readable EPUB files

These details have not been verified by PyPI

Project links

Project description

pdf2epub

pdf2epub is a Python CLI for converting PDF books into readable EPUB files. The scope is intentionally narrow for now: one solid pipeline, PDF -> EPUB.

Requires Python 3.12+.

What it does

reads a PDF and extracts text page by page
supports both text-layer PDFs and scanned or image-only PDFs
can call a local or self-hosted OpenAI-compatible model for OCR cleanup and Markdown structuring
builds an EPUB from the cleaned and reflowed content instead of dumping raw text

Install

Published package name on PyPI: pdf2epub-cli

Installed command name: pdf2epub

If you only want the CLI on your machine, prefer uv tool or pipx:

uv tool install pdf2epub-cli
pdf2epub --help

pipx install pdf2epub-cli
pdf2epub --help

If you want it inside an existing Python environment:

pip install pdf2epub-cli
pdf2epub --help

Run from source

Sync dependencies:

uv sync

Run the CLI in the repo:

uv run pdf2epub --help

If you want a local editable command while developing in the repo:

uv tool install -e .
pdf2epub --help

You can also install it directly from GitHub:

uv tool install git+https://github.com/ahpxex/pdf2epub
pdf2epub --help

The PyPI distribution is called pdf2epub-cli because the plain pdf2epub package name is already occupied on PyPI. The executable command remains pdf2epub.

CLI usage

uv run pdf2epub <pdf_path> [options]

Common options:

-o, --output <path>: output EPUB path, default is <pdf_name>.epub
--page <n>: print text from a single 1-based page and exit
--title <text>: override EPUB title
--author <text>: override EPUB author
--language <code>: override language code such as en or zh
--text-only: print extracted text only, do not generate EPUB
--extract-mode <auto|native|llm>: choose the extraction strategy
--batch-size <n>: pages per LLM batch, default 5
--max-workers <n>: concurrent LLM requests, default 4
--llm-model <name>: model name
--llm-base-url <url>: OpenAI-compatible base URL
--llm-api-key <key>: API key
--llm-timeout <seconds>: per-request timeout, default 120
--llm-temperature <value>: sampling temperature, default 0.0

Most common workflows

Convert a text-based PDF

If the PDF already has a selectable text layer, prefer native mode:

uv run pdf2epub ./book.pdf --extract-mode native

This writes ./book.epub by default.

Set output path and metadata

uv run pdf2epub ./book.pdf \
  --extract-mode native \
  --output ./out/book.epub \
  --title "Custom Title" \
  --author "Author Name" \
  --language en

Inspect one page

uv run pdf2epub ./book.pdf --page 12

This is useful when you want to debug extraction quality on a specific page.

Print cleaned text only

uv run pdf2epub ./book.pdf --text-only --extract-mode native

Convert a scanned PDF

For a local Ollama server or any other OpenAI-compatible endpoint:

export PDF2EPUB_LLM_BASE_URL=http://127.0.0.1:11434/v1
export PDF2EPUB_LLM_API_KEY=dummy
export PDF2EPUB_LLM_MODEL=glm-ocr

uv run pdf2epub ./scanned-book.pdf --extract-mode llm

You can also pass the same config via flags:

uv run pdf2epub ./scanned-book.pdf \
  --extract-mode llm \
  --llm-model glm-ocr \
  --llm-base-url http://127.0.0.1:11434/v1 \
  --llm-api-key dummy

Extraction modes

`native`

Reads the embedded PDF text layer directly.

best for normal text PDFs
fast
does not require a model
does not work for image-only scanned PDFs

uv run pdf2epub ./book.pdf --extract-mode native

`llm`

Uses the LLM OCR / cleanup / Markdown structuring pipeline.

best for scanned PDFs, image-heavy PDFs, or noisy extraction
more robust on image pages
slower than native mode
requires model configuration

uv run pdf2epub ./scan.pdf --extract-mode llm

`auto`

Chooses between native and llm based on text density.

in --text-only mode, it can fall back automatically
in normal EPUB generation mode, you should still configure the LLM in advance, because the document may be classified as OCR-needed

If you already know the PDF has a clean text layer, --extract-mode native is usually the safer choice.

LLM configuration

pdf2epub looks for these environment variables first:

PDF2EPUB_LLM_MODEL
PDF2EPUB_LLM_BASE_URL
PDF2EPUB_LLM_API_KEY

It also accepts these standard fallbacks:

OPENAI_MODEL
OPENAI_BASE_URL
OPENAI_API_KEY

Example .env:

PDF2EPUB_LLM_BASE_URL=http://127.0.0.1:11434/v1
PDF2EPUB_LLM_API_KEY=dummy
PDF2EPUB_LLM_MODEL=glm-ocr

Output behavior

default output path: same directory as the source PDF, with an .epub extension
if --title, --author, or --language are omitted, the CLI tries to infer them from PDF metadata and extracted text
if text extraction succeeds but no chapters can be built, the CLI exits with an error instead of silently generating a broken EPUB

Testing

Run the unit test suite:

uv run pytest

Run the live local-model integration test:

PDF2EPUB_RUN_LIVE_LLM=1 uv run pytest -m live_llm -s

Benchmarks and local book fixtures

manual end-to-end artifacts live under benchmarks/artifacts/e2e/
local linked book fixtures live under benchmarks/local/downloads_books/files/
automated quality regression script:

uv run python scripts/run_quality_regression.py

Naming and scope

The project used to be called any2epub. It is now intentionally narrowed to pdf2epub: the current goal is not "convert anything to EPUB", but to make the PDF-to-EPUB pipeline stable and high quality first.

For package distribution, PyPI uses pdf2epub-cli while the executable command stays pdf2epub.

License

This project is licensed under AGPL-3.0-or-later. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 22, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2epub_cli-0.1.0.tar.gz (54.9 kB view details)

Uploaded Mar 22, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pdf2epub_cli-0.1.0-py3-none-any.whl (58.3 kB view details)

Uploaded Mar 22, 2026 Python 3

File details

Details for the file pdf2epub_cli-0.1.0.tar.gz.

File metadata

Download URL: pdf2epub_cli-0.1.0.tar.gz
Upload date: Mar 22, 2026
Size: 54.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2epub_cli-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6a96d28052d40da88cb48e9b0b7c54ee9422a18b4f8729c95162d83c0af7c6ed`
MD5	`db048e49bf3af694bc1661e3cff0cb2c`
BLAKE2b-256	`3c3404ebbbe140a0ddb8c0cc3267ad92dbd5e826a76b256f32fd173ed747d270`

See more details on using hashes here.

File details

Details for the file pdf2epub_cli-0.1.0-py3-none-any.whl.

File metadata

Download URL: pdf2epub_cli-0.1.0-py3-none-any.whl
Upload date: Mar 22, 2026
Size: 58.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2epub_cli-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2b36ebf9865570f1d6b08bc6f6a5998b797d78c93c2ffb55db5bc2152ca2c402`
MD5	`70232320d5f60db9855fe347589f5939`
BLAKE2b-256	`032828a4ca810431f5156d2f20276d62b75c54d902801ac42e28ad908800834c`

See more details on using hashes here.

pdf2epub-cli 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pdf2epub

What it does

Install

Run from source

CLI usage

Most common workflows

Convert a text-based PDF

Set output path and metadata

Inspect one page

Print cleaned text only

Convert a scanned PDF

Extraction modes

native

llm

auto

LLM configuration

Output behavior

Testing

Benchmarks and local book fixtures

Naming and scope

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`native`

`llm`

`auto`