A professional EPUB structure and content extraction library with Dublin Core metadata support.
Project description
EpubSage
EpubSage is a powerful Python library and CLI tool for extracting structured content, metadata, and images from EPUB files. It handles the complexity of diverse publisher formats and provides a clean, unified API.
Why EpubSage?
EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters split across files, and metadata formats differ wildly. EpubSage abstracts this complexity:
from epub_sage import process_epub
result = process_epub("book.epub")
print(f"Title: {result.title}")
print(f"Chapters: {result.total_chapters}")
That's it. One function call to extract everything.
Features
| Feature | Description |
|---|---|
| Publisher-Agnostic | Works with O'Reilly, Packt, Manning, and more |
| Complete Extraction | Chapters, metadata, images, word counts, reading time |
| TOC-Based Extraction | Precise section splitting using TOC anchor boundaries |
| Smart Image Handling | Discovers and validates all referenced images |
| Content Classification | Identifies front matter, chapters, back matter, parts |
| Dublin Core Metadata | Full standards-compliant metadata extraction |
| TOC Parsing | Supports NCX (EPUB 2) and NAV (EPUB 3) |
| Full-Text Search | Search across all book content |
| CLI Tool | 13 commands for complete EPUB analysis |
Requirements
- Python 3.10+
- Dependencies:
beautifulsoup4,lxml,pydantic,typer,rich
Installation
pip install epubsage
Or with uv:
uv add epubsage
Quick Start
Python
from epub_sage import process_epub
result = process_epub("book.epub")
print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Words: {result.total_words:,}")
print(f"Reading time: {result.estimated_reading_time}")
for chapter in result.chapters[:3]:
print(f" {chapter['chapter_id']}: {chapter['title']}")
Command Line
epub-sage info book.epub
Command Line Interface
EpubSage includes 13 commands for complete EPUB analysis.
epub-sage --help
| Command | Description |
|---|---|
info |
Quick book summary |
stats |
Detailed statistics |
chapters |
List chapters with word counts |
metadata |
Dublin Core metadata |
toc |
Table of contents |
images |
Image distribution |
search |
Full-text search |
validate |
Validate EPUB structure |
spine |
Reading order |
manifest |
All EPUB resources |
extract |
Export to JSON |
list |
Raw EPUB contents |
cover |
Extract cover image |
Key Commands
chapters
epub-sage chapters book.epub
search
epub-sage search book.epub "machine learning"
extract
epub-sage extract book.epub -o output.json
Python Library
Basic Processing
from epub_sage import process_epub
result = process_epub("book.epub")
if result.success:
print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Chapters: {result.total_chapters}")
else:
print(f"Errors: {result.errors}")
Iterate Chapters
for chapter in result.chapters:
print(f"{chapter['chapter_id']}: {chapter['title']}")
print(f" Words: {chapter['word_count']}")
print(f" Images: {len(chapter['images'])}")
print(f" Type: {chapter['content_type']}")
Access Metadata
metadata = result.full_metadata
print(f"Title: {metadata.title}")
print(f"Publisher: {metadata.publisher}")
print(f"ISBN: {metadata.get_isbn()}")
print(f"Publication Date: {metadata.get_publication_date()}")
Extract Images
for chapter in result.chapters:
if chapter['images']:
print(f"Chapter: {chapter['title']}")
for img in chapter['images']:
print(f" - {img}")
Content Blocks
chapter = result.chapters[0]
for block in chapter['content']:
print(f"[{block['tag']}] {block['text'][:100]}...")
Output Format
SimpleEpubResult
| Field | Type | Description |
|---|---|---|
title |
str |
Book title |
author |
str |
Primary author |
publisher |
str |
Publisher name |
chapters |
list[dict] |
Chapter data |
total_chapters |
int |
Chapter count |
total_words |
int |
Word count |
estimated_reading_time |
dict |
{'hours': N, 'minutes': N} |
success |
bool |
Processing status |
full_metadata |
DublinCoreMetadata |
Complete metadata |
Chapter Dictionary
| Field | Type | Description |
|---|---|---|
chapter_id |
int |
Sequential ID |
title |
str |
Chapter title |
word_count |
int |
Words in chapter |
images |
list[str] |
Image paths |
content |
list[dict] |
Content blocks |
sections |
list[dict] |
TOC-based sections with nested subsections |
content_type |
str |
chapter, front_matter, back_matter, part |
Architecture
epub_sage/
├── core/ # Parsers (Dublin Core, Structure, TOC)
├── extractors/ # EPUB handling, content extraction
├── processors/ # Processing pipelines
├── models/ # Pydantic data models
├── services/ # Search, export services
└── cli.py # Command-line interface
Processing Pipeline:
EpubExtractor→ Unzips EPUBDublinCoreParser→ Extracts metadataEpubStructureParser→ Analyzes structureContentExtractor→ Extracts text & imagesSimpleEpubProcessor→ Orchestrates all steps
Development
Setup
git clone https://github.com/Abdullah-Wex/epubsage.git
cd epubsage
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"
Commands
make test # Run 60+ tests
make format # Format code
make lint # Check quality
Running Tests
PYTHONPATH="$PWD" .venv/bin/python -m pytest tests/ -v
Documentation
| Document | Description |
|---|---|
| CLI Reference | Complete CLI documentation |
| API Reference | Python API documentation |
| Examples | Real-world use cases |
License
MIT License. See LICENSE for details.
Contributing
Contributions welcome! See CONTRIBUTING.md for guidelines.
EpubSage — Extract. Analyze. Build.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file epubsage-0.3.0.tar.gz.
File metadata
- Download URL: epubsage-0.3.0.tar.gz
- Upload date:
- Size: 51.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f8abd4bfb38bdf251348a4432ad5ff6c2412c2230258e712840d203b74e541f
|
|
| MD5 |
b4bf54ba350eefc2f784f2dfd04850c6
|
|
| BLAKE2b-256 |
6eae913cab9b9d8c66385c4202bc771170d5d9f38ab4fa5b27dffbd811bd1832
|
Provenance
The following attestation bundles were made for epubsage-0.3.0.tar.gz:
Publisher:
publish.yml on Abdullah-Wex/epubsage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
epubsage-0.3.0.tar.gz -
Subject digest:
9f8abd4bfb38bdf251348a4432ad5ff6c2412c2230258e712840d203b74e541f - Sigstore transparency entry: 809314841
- Sigstore integration time:
-
Permalink:
Abdullah-Wex/epubsage@91e8ca0ee35acc63977e94829f5726af3faa7209 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Abdullah-Wex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@91e8ca0ee35acc63977e94829f5726af3faa7209 -
Trigger Event:
release
-
Statement type:
File details
Details for the file epubsage-0.3.0-py3-none-any.whl.
File metadata
- Download URL: epubsage-0.3.0-py3-none-any.whl
- Upload date:
- Size: 75.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
129c42e81b8c8a48b70e22e046f0ed316d97d9948d7d1a57e6f7b2206e99e20e
|
|
| MD5 |
c675be6b703d4a0be62b50562239e8c0
|
|
| BLAKE2b-256 |
5016c052affab5544c686db38b999970ac3e6d84871bf0093c0d8303c7ec0071
|
Provenance
The following attestation bundles were made for epubsage-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on Abdullah-Wex/epubsage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
epubsage-0.3.0-py3-none-any.whl -
Subject digest:
129c42e81b8c8a48b70e22e046f0ed316d97d9948d7d1a57e6f7b2206e99e20e - Sigstore transparency entry: 809314845
- Sigstore integration time:
-
Permalink:
Abdullah-Wex/epubsage@91e8ca0ee35acc63977e94829f5726af3faa7209 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/Abdullah-Wex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@91e8ca0ee35acc63977e94829f5726af3faa7209 -
Trigger Event:
release
-
Statement type: