Skip to main content

Automated pipeline for Ebook metadata enrichment, conversion, and cloud upload.

Project description

📚 Epub Pipeline

PyPI Version Python Version License Build Status Code Style: ruff

The ultimate automated tool for curating your Ebook library.

This pipeline extracts metadata from your EPUB files, attempts to find better metadata online (Google Books, OpenLibrary), standardizes filenames, converts to KEPUB (for Kobo e-readers), and uploads the results to Google Drive or a local folder.

Key Features

  • Smart Metadata Enrichment:
    • Waterfall Search Strategy: Prioritizes ISBN lookups (high precision) but falls back to a "relaxed" text search (Title/Author/Publisher) if no ISBN is found.
    • Confidence Scoring: Calculates a reliability score (0-100%) for each match based on title similarity, author overlap, and result uniqueness.
  • Safety First:
    • Interactive Review: By default, low-confidence matches require your confirmation.
    • Granular Control (-i): Optionally review every single field change (Title, Author, Description, etc.) before applying.
    • Non-Destructive: Processes files in a temporary workspace; original files are never modified in place unless output to the same directory.
  • Media Management:
    • High-Res Covers: Automatically downloads and optimizes covers for e-ink screens (resizing to max 1600x2400, grayscale optimized JPEG).
  • Kobo Optimization:
    • Native integration with kepubify to convert EPUBs to KEPUB for faster page turns and better formatting on Kobo devices.
  • Cloud Sync:
    • Direct upload to Google Drive (ideal for use with KoboCloud).
    • Resumable uploads for large files.

Installation

1. Prerequisites

  • Python 3.12+
  • Kepubify: Required for Kobo conversion.
    1. Download the binary from pgaskin/kepubify.
    2. Place it in your system PATH (recommended).
    3. Rename it to kepubify (Windows: kepubify.exe) and ensure it is executable.

2. Install Package

Clone the repository and install it in editable mode:

git clone https://github.com/your-repo/epub-pipeline.git
cd epub-pipeline
pip install -e .

This will install the epubpipe command globally in your Python environment.

3. Configuration (.env)

Copy the template and edit your settings:

cp .env.example .env

Note: The tool looks for .env in the directory where you run the command.

4. Google Drive (Optional)

To enable Cloud Upload:

  1. Create a project in Google Cloud Console.
  2. Enable the Google Drive API.
  3. Create OAuth 2.0 Client IDs (Desktop App).
  4. Download the JSON, rename it to credentials.json, and place it in your working directory.
  5. Set GOOGLE_CREDENTIALS_PATH=credentials.json in .env.

Usage

Basic Usage

Process a single file or an entire directory using the CLI command:

# Process all .epub files in the data/ folder
epubpipe data/

# Process a specific file
epubpipe data/dune.epub

CLI Options

Flag Description
-i, --interactive Granular Review Mode: Ask for confirmation for each field (Title, Date, Cover...) that differs.
--auto Batch Mode: Automatically accept changes if confidence > 80%, skip others.
--no-kepub Disable KEPUB conversion for this run.
--no-rename Keep original filenames.
--no-upload Process locally only (files remain in output/ or temp).
--isbn <ISBN> Force a specific ISBN for the search (works only with single file).
-v, --verbose Enable debug logs.
-s <source> Limit search to google or openlibrary.

Examples

1. Interactive Review (Recommended for new books)

epubpipe data/new_books/ -i

2. Force specific ISBN Useful if the automatic search finds the wrong edition.

epubpipe data/unknown_book.epub --isbn 9780441172719

3. Offline / Local Only Just clean metadata, rename, and convert, without uploading.

epubpipe data/ --no-upload --no-kepub

Debugging Tools

The tools/ directory contains standalone scripts to diagnose issues. You can run them as modules from the project root:

  • Inspector: See exactly what metadata exists inside a file.
    python -m tools.inspect data/book.epub --full
    
  • Search Tester: Test the search logic and see confidence scores without changing files.
    python -m tools.search data/book.epub
    
  • Dry Run: Simulate the whole process (including renaming/conversion logic) without writing to disk.
    python -m tools.dry_run data/
    
  • Manual Upload: Upload a file or folder to Google Drive immediately.
    python -m tools.upload data/book.epub
    

Development

Setup

# Install in editable mode with dev dependencies
pip install -e .[dev]

# Install pre-commit hooks
pre-commit install

Running Tests

pytest

Manual Linting

ruff check .
mypy .

Credits

  • kepubify by pgaskin.
  • Google Books API & OpenLibrary API.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epub_pipeline-1.0.1.tar.gz (37.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

epub_pipeline-1.0.1-py3-none-any.whl (47.1 kB view details)

Uploaded Python 3

File details

Details for the file epub_pipeline-1.0.1.tar.gz.

File metadata

  • Download URL: epub_pipeline-1.0.1.tar.gz
  • Upload date:
  • Size: 37.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for epub_pipeline-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9a02fdce43217eee25503dd2602d4f1e6d9b5a3ad0709f6a9be319b1ac0bbdea
MD5 114e5da305f5803a833a3a1e213c431f
BLAKE2b-256 5841d3df53dd32898e1c6fbd4ba9b150a86c5ab4c629075ee284fbb0cfb48416

See more details on using hashes here.

File details

Details for the file epub_pipeline-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: epub_pipeline-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 47.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for epub_pipeline-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8aa0e1c620c9dadd316e984d9a21acbae9769b27715674f3ba6ce8cd7403c3d6
MD5 08ba51c04f81ca488e0e689289056a96
BLAKE2b-256 d95fac55d0bd84305c15b4931e9ff7c66649bf07c80eee81aae2b7035540b52e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page