A professional EPUB structure and content extraction library with Dublin Core metadata support.

These details have not been verified by PyPI

Project description

EpubSage

EpubSage is a powerful Python library and CLI tool for extracting structured content, metadata, and images from EPUB files. It handles the complexity of diverse publisher formats and provides a clean, unified API.

Why EpubSage?

EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters split across files, and metadata formats differ wildly. EpubSage abstracts this complexity:

from epub_sage import process_epub

result = process_epub("book.epub")
print(f"Title: {result.title}")
print(f"Chapters: {result.total_chapters}")

That's it. One function call to extract everything.

Features

Feature	Description
Publisher-Agnostic	Works with O'Reilly, Packt, Manning, and more
Complete Extraction	Chapters, metadata, images, word counts, reading time
TOC-Based Extraction	Precise section splitting using TOC anchor boundaries
Smart Image Handling	Discovers and validates all referenced images
Content Classification	Identifies front matter, chapters, back matter, parts
Dublin Core Metadata	Full standards-compliant metadata extraction
TOC Parsing	Supports NCX (EPUB 2) and NAV (EPUB 3)
Full-Text Search	Search across all book content
CLI Tool	13 commands for complete EPUB analysis

Requirements

Python 3.10+
Dependencies: beautifulsoup4, lxml, pydantic, typer, rich

Installation

pip install epubsage

Or with uv:

uv add epubsage

Quick Start

Python

from epub_sage import process_epub

result = process_epub("book.epub")

print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Words: {result.total_words:,}")
print(f"Reading time: {result.estimated_reading_time}")

for chapter in result.chapters[:3]:
    print(f"  {chapter['chapter_id']}: {chapter['title']}")

Python Basic Usage

Command Line

epub-sage info book.epub

CLI Info

Command Line Interface

EpubSage includes 13 commands for complete EPUB analysis.

epub-sage --help

CLI Help

Command	Description
`info`	Quick book summary
`stats`	Detailed statistics
`chapters`	List chapters with word counts
`metadata`	Dublin Core metadata
`toc`	Table of contents
`images`	Image distribution
`search`	Full-text search
`validate`	Validate EPUB structure
`spine`	Reading order
`manifest`	All EPUB resources
`extract`	Export to JSON
`list`	Raw EPUB contents
`cover`	Extract cover image

View full CLI documentation →

Key Commands

chapters

epub-sage chapters book.epub

CLI Chapters

search

epub-sage search book.epub "machine learning"

CLI Search

extract

epub-sage extract book.epub -o output.json

CLI Extract

Python Library

Basic Processing

from epub_sage import process_epub

result = process_epub("book.epub")

if result.success:
    print(f"Title: {result.title}")
    print(f"Author: {result.author}")
    print(f"Chapters: {result.total_chapters}")
else:
    print(f"Errors: {result.errors}")

Iterate Chapters

for chapter in result.chapters:
    print(f"{chapter['chapter_id']}: {chapter['title']}")
    print(f"  Words: {chapter['word_count']}")
    print(f"  Images: {len(chapter['images'])}")
    print(f"  Type: {chapter['content_type']}")

Python Chapters

Access Metadata

metadata = result.full_metadata

print(f"Title: {metadata.title}")
print(f"Publisher: {metadata.publisher}")
print(f"ISBN: {metadata.get_isbn()}")
print(f"Publication Date: {metadata.get_publication_date()}")

Python Metadata

Extract Images

for chapter in result.chapters:
    if chapter['images']:
        print(f"Chapter: {chapter['title']}")
        for img in chapter['images']:
            print(f"  - {img}")

Python Images

Content Blocks

chapter = result.chapters[0]

for block in chapter['content']:
    print(f"[{block['tag']}] {block['text'][:100]}...")

Python Content

View full API documentation →

View real-world examples →

Output Format

SimpleEpubResult

Field	Type	Description
`title`	`str`	Book title
`author`	`str`	Primary author
`publisher`	`str`	Publisher name
`chapters`	`list[dict]`	Chapter data
`total_chapters`	`int`	Chapter count
`total_words`	`int`	Word count
`estimated_reading_time`	`dict`	`{'hours': N, 'minutes': N}`
`success`	`bool`	Processing status
`full_metadata`	`DublinCoreMetadata`	Complete metadata

Chapter Dictionary

Field	Type	Description
`chapter_id`	`int`	Sequential ID
`title`	`str`	Chapter title
`word_count`	`int`	Words in chapter
`images`	`list[str]`	Image paths
`content`	`list[dict]`	Content blocks
`sections`	`list[dict]`	TOC-based sections with nested `subsections`
`content_type`	`str`	`chapter`, `front_matter`, `back_matter`, `part`

View complete data models →

Architecture

epub_sage/
├── core/           # Parsers (Dublin Core, Structure, TOC)
├── extractors/     # EPUB handling, content extraction
├── processors/     # Processing pipelines
├── models/         # Pydantic data models
├── services/       # Search, export services
└── cli.py          # Command-line interface

Processing Pipeline:

EpubExtractor → Unzips EPUB
DublinCoreParser → Extracts metadata
EpubStructureParser → Analyzes structure
ContentExtractor → Extracts text & images
SimpleEpubProcessor → Orchestrates all steps

Development

Setup

git clone https://github.com/Abdullah-Wex/epubsage.git
cd epubsage
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

Commands

make test      # Run 60+ tests
make format    # Format code
make lint      # Check quality

Running Tests

PYTHONPATH="$PWD" .venv/bin/python -m pytest tests/ -v

Documentation

Document	Description
CLI Reference	Complete CLI documentation
API Reference	Python API documentation
Examples	Real-world use cases

License

MIT License. See LICENSE for details.

Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

EpubSage — Extract. Analyze. Build.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Jan 9, 2026

0.2.0

Jan 3, 2026

0.1.1

Dec 30, 2025

0.1.0

Dec 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epubsage-0.3.0.tar.gz (51.3 kB view details)

Uploaded Jan 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

epubsage-0.3.0-py3-none-any.whl (75.3 kB view details)

Uploaded Jan 9, 2026 Python 3

File details

Details for the file epubsage-0.3.0.tar.gz.

File metadata

Download URL: epubsage-0.3.0.tar.gz
Upload date: Jan 9, 2026
Size: 51.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epubsage-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`9f8abd4bfb38bdf251348a4432ad5ff6c2412c2230258e712840d203b74e541f`
MD5	`b4bf54ba350eefc2f784f2dfd04850c6`
BLAKE2b-256	`6eae913cab9b9d8c66385c4202bc771170d5d9f38ab4fa5b27dffbd811bd1832`

See more details on using hashes here.

Provenance

The following attestation bundles were made for epubsage-0.3.0.tar.gz:

Publisher: publish.yml on Abdullah-Wex/epubsage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: epubsage-0.3.0.tar.gz
- Subject digest: 9f8abd4bfb38bdf251348a4432ad5ff6c2412c2230258e712840d203b74e541f
- Sigstore transparency entry: 809314841
- Sigstore integration time: Jan 9, 2026
Source repository:
- Permalink: Abdullah-Wex/epubsage@91e8ca0ee35acc63977e94829f5726af3faa7209
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/Abdullah-Wex
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@91e8ca0ee35acc63977e94829f5726af3faa7209
- Trigger Event: release

File details

Details for the file epubsage-0.3.0-py3-none-any.whl.

File metadata

Download URL: epubsage-0.3.0-py3-none-any.whl
Upload date: Jan 9, 2026
Size: 75.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epubsage-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`129c42e81b8c8a48b70e22e046f0ed316d97d9948d7d1a57e6f7b2206e99e20e`
MD5	`c675be6b703d4a0be62b50562239e8c0`
BLAKE2b-256	`5016c052affab5544c686db38b999970ac3e6d84871bf0093c0d8303c7ec0071`

See more details on using hashes here.

Provenance

The following attestation bundles were made for epubsage-0.3.0-py3-none-any.whl:

Publisher: publish.yml on Abdullah-Wex/epubsage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: epubsage-0.3.0-py3-none-any.whl
- Subject digest: 129c42e81b8c8a48b70e22e046f0ed316d97d9948d7d1a57e6f7b2206e99e20e
- Sigstore transparency entry: 809314845
- Sigstore integration time: Jan 9, 2026
Source repository:
- Permalink: Abdullah-Wex/epubsage@91e8ca0ee35acc63977e94829f5726af3faa7209
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/Abdullah-Wex
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@91e8ca0ee35acc63977e94829f5726af3faa7209
- Trigger Event: release

epubsage 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

EpubSage

Why EpubSage?

Features

Requirements

Installation

Quick Start

Python

Command Line

Command Line Interface

Key Commands

chapters

search

extract

Python Library

Basic Processing

Iterate Chapters

Access Metadata

Extract Images

Content Blocks

Output Format

SimpleEpubResult

Chapter Dictionary

Architecture

Development

Setup

Commands

Running Tests

Documentation

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance