Skip to main content

A professional EPUB structure and content extraction library with Dublin Core metadata support.

Project description

EpubSage

PyPI version Python versions Tests License: MIT

EpubSage is a powerful Python library and CLI tool for extracting structured content, metadata, and images from EPUB files. It handles the complexity of diverse publisher formats and provides a clean, unified API.

Why EpubSage?

EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters split across files, and metadata formats differ wildly. EpubSage abstracts this complexity:

from epub_sage import process_epub

result = process_epub("book.epub")
print(f"Title: {result.title}")
print(f"Chapters: {result.total_chapters}")

That's it. One function call to extract everything.

Features

Feature Description
Publisher-Agnostic Works with O'Reilly, Packt, Manning, and more
Complete Extraction Chapters, metadata, images, word counts, reading time
TOC-Based Extraction Precise section splitting using TOC anchor boundaries
Smart Image Handling Discovers and validates all referenced images
Content Classification Identifies front matter, chapters, back matter, parts
Dublin Core Metadata Full standards-compliant metadata extraction
TOC Parsing Supports NCX (EPUB 2) and NAV (EPUB 3)
Full-Text Search Search across all book content
CLI Tool 13 commands for complete EPUB analysis

Requirements

  • Python 3.10+
  • Dependencies: beautifulsoup4, lxml, pydantic, typer, rich

Installation

pip install epubsage

Or with uv:

uv add epubsage

Quick Start

Python

from epub_sage import process_epub

result = process_epub("book.epub")

print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Words: {result.total_words:,}")
print(f"Reading time: {result.estimated_reading_time}")

for chapter in result.chapters[:3]:
    print(f"  {chapter['chapter_id']}: {chapter['title']}")

Python Basic Usage

Command Line

epub-sage info book.epub

CLI Info

Command Line Interface

EpubSage includes 13 commands for complete EPUB analysis.

epub-sage --help

CLI Help

Command Description
info Quick book summary
stats Detailed statistics
chapters List chapters with word counts
metadata Dublin Core metadata
toc Table of contents
images Image distribution
search Full-text search
validate Validate EPUB structure
spine Reading order
manifest All EPUB resources
extract Export to JSON
list Raw EPUB contents
cover Extract cover image

View full CLI documentation →

Key Commands

chapters

epub-sage chapters book.epub

CLI Chapters

search

epub-sage search book.epub "machine learning"

CLI Search

extract

epub-sage extract book.epub -o output.json

CLI Extract

Python Library

Basic Processing

from epub_sage import process_epub

result = process_epub("book.epub")

if result.success:
    print(f"Title: {result.title}")
    print(f"Author: {result.author}")
    print(f"Chapters: {result.total_chapters}")
else:
    print(f"Errors: {result.errors}")

Iterate Chapters

for chapter in result.chapters:
    print(f"{chapter['chapter_id']}: {chapter['title']}")
    print(f"  Words: {chapter['word_count']}")
    print(f"  Images: {len(chapter['images'])}")
    print(f"  Type: {chapter['content_type']}")

Python Chapters

Access Metadata

metadata = result.full_metadata

print(f"Title: {metadata.title}")
print(f"Publisher: {metadata.publisher}")
print(f"ISBN: {metadata.get_isbn()}")
print(f"Publication Date: {metadata.get_publication_date()}")

Python Metadata

Extract Images

for chapter in result.chapters:
    if chapter['images']:
        print(f"Chapter: {chapter['title']}")
        for img in chapter['images']:
            print(f"  - {img}")

Python Images

Content Blocks

chapter = result.chapters[0]

for block in chapter['content']:
    print(f"[{block['tag']}] {block['text'][:100]}...")

Python Content

View full API documentation →

View real-world examples →

Output Format

SimpleEpubResult

Field Type Description
title str Book title
author str Primary author
publisher str Publisher name
chapters list[dict] Chapter data
total_chapters int Chapter count
total_words int Word count
estimated_reading_time dict {'hours': N, 'minutes': N}
success bool Processing status
full_metadata DublinCoreMetadata Complete metadata

Chapter Dictionary

Field Type Description
chapter_id int Sequential ID
title str Chapter title
word_count int Words in chapter
images list[str] Image paths
content list[dict] Content blocks
sections list[dict] TOC-based sections with nested subsections
content_type str chapter, front_matter, back_matter, part

View complete data models →

Architecture

epub_sage/
├── core/           # Parsers (Dublin Core, Structure, TOC)
├── extractors/     # EPUB handling, content extraction
├── processors/     # Processing pipelines
├── models/         # Pydantic data models
├── services/       # Search, export services
└── cli.py          # Command-line interface

Processing Pipeline:

  1. EpubExtractor → Unzips EPUB
  2. DublinCoreParser → Extracts metadata
  3. EpubStructureParser → Analyzes structure
  4. ContentExtractor → Extracts text & images
  5. SimpleEpubProcessor → Orchestrates all steps

Development

Setup

git clone https://github.com/Abdullah-Wex/epubsage.git
cd epubsage
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

Commands

make test      # Run 60+ tests
make format    # Format code
make lint      # Check quality

Running Tests

PYTHONPATH="$PWD" .venv/bin/python -m pytest tests/ -v

Documentation

Document Description
CLI Reference Complete CLI documentation
API Reference Python API documentation
Examples Real-world use cases

License

MIT License. See LICENSE for details.

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.


EpubSage — Extract. Analyze. Build.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epubsage-0.3.0.tar.gz (51.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

epubsage-0.3.0-py3-none-any.whl (75.3 kB view details)

Uploaded Python 3

File details

Details for the file epubsage-0.3.0.tar.gz.

File metadata

  • Download URL: epubsage-0.3.0.tar.gz
  • Upload date:
  • Size: 51.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epubsage-0.3.0.tar.gz
Algorithm Hash digest
SHA256 9f8abd4bfb38bdf251348a4432ad5ff6c2412c2230258e712840d203b74e541f
MD5 b4bf54ba350eefc2f784f2dfd04850c6
BLAKE2b-256 6eae913cab9b9d8c66385c4202bc771170d5d9f38ab4fa5b27dffbd811bd1832

See more details on using hashes here.

Provenance

The following attestation bundles were made for epubsage-0.3.0.tar.gz:

Publisher: publish.yml on Abdullah-Wex/epubsage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file epubsage-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: epubsage-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 75.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epubsage-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 129c42e81b8c8a48b70e22e046f0ed316d97d9948d7d1a57e6f7b2206e99e20e
MD5 c675be6b703d4a0be62b50562239e8c0
BLAKE2b-256 5016c052affab5544c686db38b999970ac3e6d84871bf0093c0d8303c7ec0071

See more details on using hashes here.

Provenance

The following attestation bundles were made for epubsage-0.3.0-py3-none-any.whl:

Publisher: publish.yml on Abdullah-Wex/epubsage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page