Skip to main content

Python CLI for turning PDF books into clean, readable EPUB files

Project description

pdf2epub

pdf2epub is a Python CLI for converting PDF books into readable EPUB files. The scope is intentionally narrow for now: one solid pipeline, PDF -> EPUB.

Requires Python 3.12+.

What it does

  • reads a PDF and extracts text page by page
  • supports both text-layer PDFs and scanned or image-only PDFs
  • can call a local or self-hosted OpenAI-compatible model for OCR cleanup and Markdown structuring
  • builds an EPUB from the cleaned and reflowed content instead of dumping raw text

Install

Published package name on PyPI: pdf2epub-cli

Installed command name: pdf2epub

If you only want the CLI on your machine, prefer uv tool or pipx:

uv tool install pdf2epub-cli
pdf2epub --help
pipx install pdf2epub-cli
pdf2epub --help

If you want it inside an existing Python environment:

pip install pdf2epub-cli
pdf2epub --help

Run from source

Sync dependencies:

uv sync

Run the CLI in the repo:

uv run pdf2epub --help

If you want a local editable command while developing in the repo:

uv tool install -e .
pdf2epub --help

You can also install it directly from GitHub:

uv tool install git+https://github.com/ahpxex/pdf2epub
pdf2epub --help

The PyPI distribution is called pdf2epub-cli because the plain pdf2epub package name is already occupied on PyPI. The executable command remains pdf2epub.

CLI usage

uv run pdf2epub <pdf_path> [options]

Common options:

  • -o, --output <path>: output EPUB path, default is <pdf_name>.epub
  • --page <n>: print text from a single 1-based page and exit
  • --title <text>: override EPUB title
  • --author <text>: override EPUB author
  • --language <code>: override language code such as en or zh
  • --text-only: print extracted text only, do not generate EPUB
  • --extract-mode <auto|native|llm>: choose the extraction strategy
  • --batch-size <n>: pages per LLM batch, default 5
  • --max-workers <n>: concurrent LLM requests, default 4
  • --llm-model <name>: model name
  • --llm-base-url <url>: OpenAI-compatible base URL
  • --llm-api-key <key>: API key
  • --llm-timeout <seconds>: per-request timeout, default 120
  • --llm-temperature <value>: sampling temperature, default 0.0

Most common workflows

Convert a text-based PDF

If the PDF already has a selectable text layer, prefer native mode:

uv run pdf2epub ./book.pdf --extract-mode native

This writes ./book.epub by default.

Set output path and metadata

uv run pdf2epub ./book.pdf \
  --extract-mode native \
  --output ./out/book.epub \
  --title "Custom Title" \
  --author "Author Name" \
  --language en

Inspect one page

uv run pdf2epub ./book.pdf --page 12

This is useful when you want to debug extraction quality on a specific page.

Print cleaned text only

uv run pdf2epub ./book.pdf --text-only --extract-mode native

Convert a scanned PDF

For a local Ollama server or any other OpenAI-compatible endpoint:

export PDF2EPUB_LLM_BASE_URL=http://127.0.0.1:11434/v1
export PDF2EPUB_LLM_API_KEY=dummy
export PDF2EPUB_LLM_MODEL=glm-ocr

uv run pdf2epub ./scanned-book.pdf --extract-mode llm

You can also pass the same config via flags:

uv run pdf2epub ./scanned-book.pdf \
  --extract-mode llm \
  --llm-model glm-ocr \
  --llm-base-url http://127.0.0.1:11434/v1 \
  --llm-api-key dummy

Extraction modes

native

Reads the embedded PDF text layer directly.

  • best for normal text PDFs
  • fast
  • does not require a model
  • does not work for image-only scanned PDFs
uv run pdf2epub ./book.pdf --extract-mode native

llm

Uses the LLM OCR / cleanup / Markdown structuring pipeline.

  • best for scanned PDFs, image-heavy PDFs, or noisy extraction
  • more robust on image pages
  • slower than native mode
  • requires model configuration
uv run pdf2epub ./scan.pdf --extract-mode llm

auto

Chooses between native and llm based on text density.

  • in --text-only mode, it can fall back automatically
  • in normal EPUB generation mode, you should still configure the LLM in advance, because the document may be classified as OCR-needed

If you already know the PDF has a clean text layer, --extract-mode native is usually the safer choice.

LLM configuration

pdf2epub looks for these environment variables first:

  • PDF2EPUB_LLM_MODEL
  • PDF2EPUB_LLM_BASE_URL
  • PDF2EPUB_LLM_API_KEY

It also accepts these standard fallbacks:

  • OPENAI_MODEL
  • OPENAI_BASE_URL
  • OPENAI_API_KEY

Example .env:

PDF2EPUB_LLM_BASE_URL=http://127.0.0.1:11434/v1
PDF2EPUB_LLM_API_KEY=dummy
PDF2EPUB_LLM_MODEL=glm-ocr

Output behavior

  • default output path: same directory as the source PDF, with an .epub extension
  • if --title, --author, or --language are omitted, the CLI tries to infer them from PDF metadata and extracted text
  • if text extraction succeeds but no chapters can be built, the CLI exits with an error instead of silently generating a broken EPUB

Testing

Run the unit test suite:

uv run pytest

Run the live local-model integration test:

PDF2EPUB_RUN_LIVE_LLM=1 uv run pytest -m live_llm -s

Benchmarks and local book fixtures

  • manual end-to-end artifacts live under benchmarks/artifacts/e2e/
  • local linked book fixtures live under benchmarks/local/downloads_books/files/
  • automated quality regression script:
uv run python scripts/run_quality_regression.py

Naming and scope

The project used to be called any2epub. It is now intentionally narrowed to pdf2epub: the current goal is not "convert anything to EPUB", but to make the PDF-to-EPUB pipeline stable and high quality first.

For package distribution, PyPI uses pdf2epub-cli while the executable command stays pdf2epub.

License

This project is licensed under AGPL-3.0-or-later. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2epub_cli-0.1.0.tar.gz (54.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf2epub_cli-0.1.0-py3-none-any.whl (58.3 kB view details)

Uploaded Python 3

File details

Details for the file pdf2epub_cli-0.1.0.tar.gz.

File metadata

  • Download URL: pdf2epub_cli-0.1.0.tar.gz
  • Upload date:
  • Size: 54.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2epub_cli-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6a96d28052d40da88cb48e9b0b7c54ee9422a18b4f8729c95162d83c0af7c6ed
MD5 db048e49bf3af694bc1661e3cff0cb2c
BLAKE2b-256 3c3404ebbbe140a0ddb8c0cc3267ad92dbd5e826a76b256f32fd173ed747d270

See more details on using hashes here.

File details

Details for the file pdf2epub_cli-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pdf2epub_cli-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 58.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pdf2epub_cli-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b36ebf9865570f1d6b08bc6f6a5998b797d78c93c2ffb55db5bc2152ca2c402
MD5 70232320d5f60db9855fe347589f5939
BLAKE2b-256 032828a4ca810431f5156d2f20276d62b75c54d902801ac42e28ad908800834c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page