Skip to main content

Clean and standardize messy book filenames using LLM + Google Books

Project description

📚 CleanMyBooks

Clean and standardize messy book filenames using LLM + Google Books API.

CleanMyBooks takes chaotic ebook filenames like python.crash.course.2ndEd_FINAL_v2.pdf and renames them to a clean, consistent format:

Eric Matthes - Python Crash Course (2019).pdf

Features

  • 🧠 LLM-powered parsing via OpenRouter (GPT-4o-mini by default)
  • 🔍 Google Books verification for authoritative metadata
  • 📊 Confidence scoring — falls back to LLM if Google match is weak
  • Parallel processing with configurable thread workers
  • 💾 JSON caching — avoids re-processing the same file twice
  • 🛡️ Safe renaming — dry-run mode, collision-safe, no overwrites
  • 📝 Supports: .pdf, .epub, .mobi, .azw, .azw3

Installation

From source

git clone https://github.com/yourusername/cleanmybooks.git
cd cleanmybooks
pip install -e .

From PyPI (once published)

pip install cleanmybooks

Setup

  1. Copy the example env file:
cp .env.example .env
  1. Add your OpenRouter API key to .env:
OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Get a free API key at openrouter.ai/keys.


Usage

Basic — rename all books in a folder

cleanmybooks /path/to/my/ebooks

Dry run — preview changes without renaming

cleanmybooks /path/to/my/ebooks --dry-run

Verbose output with more workers

cleanmybooks /path/to/my/ebooks --workers 8 --verbose

Adjust confidence threshold

cleanmybooks /path/to/my/ebooks --confidence-threshold 0.75

Lower threshold = trust LLM more. Higher threshold = require stronger Google Books match.

Clear the cache

cleanmybooks --clear-cache

CLI Options

Option Default Description
folder (required) Directory containing book files
--dry-run False Preview changes without renaming
--workers N 4 Number of parallel threads
--confidence-threshold FLOAT 0.6 Min similarity score to use Google result
--verbose False Enable debug logging
--cache-file PATH ~/.cleanmybooks_cache.json Custom cache file location
--clear-cache Clear cache and exit

Example Input/Output

Original Filename Cleaned Filename
python.crash.course.2ndEd.pdf Eric Matthes - Python Crash Course (2019).pdf
DUNE_frank_herbert_scanned.epub Frank Herbert - Dune (1965).epub
clean_code_uncle_bob.pdf Robert C. Martin - Clean Code (2008).pdf
atomic_habits_james_clear_2018.epub James Clear - Atomic Habits (2018).epub
unknown_book_v3_FINAL.pdf Unknown Author - Unknown Title.pdf (graceful fallback)

Output Format

Author - Title (Year).ext

Multi-author books are collapsed to:

First Author et al. - Title (Year).ext

How It Works

filename.pdf
    │
    ▼
[LLM via OpenRouter]
    │  Parse: title, authors, year
    ▼
[Google Books API]
    │  Verify and enrich metadata
    ▼
[Confidence Score]
    │  Token-overlap similarity (Jaccard)
    │  ≥ threshold → use Google result
    │  < threshold → fall back to LLM result
    ▼
[Rename]
    │  Sanitize characters
    │  Resolve collisions
    │  Author - Title (Year).ext
    ▼
[Cache] → skip on next run

Environment Variables

Variable Description
OPENROUTER_API_KEY Required. Your OpenRouter API key

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleanmybooks-0.1.0.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleanmybooks-0.1.0-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file cleanmybooks-0.1.0.tar.gz.

File metadata

  • Download URL: cleanmybooks-0.1.0.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for cleanmybooks-0.1.0.tar.gz
Algorithm Hash digest
SHA256 444269c660d73818a83825adf5601c1efa6b0b357ded9a0de04943efda0ba4c2
MD5 f7de6fea83d3db69efad31992db7a46e
BLAKE2b-256 eadc4482aa73de6334a1a3fffed57ed541f90dcef8ffda585d9c8ad59a1a9038

See more details on using hashes here.

File details

Details for the file cleanmybooks-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cleanmybooks-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for cleanmybooks-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 76c7544347c842db7ebd6a1fecb0a4cd392bf91a3994d6afbcff748672de15a4
MD5 170219d98d38636999e87cccc6075017
BLAKE2b-256 c2df1a4bd556421a08b5ff9018a85778254555cc447ed5033bf142248366f2f3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page