A professional EPUB structure and content extraction library with Dublin Core metadata support.
Project description
EpubSage
EpubSage is a Python library for extracting structured content and metadata from EPUB files. It handles the complexity of diverse publisher formats (Manning, O'Reilly, Packt, etc.) and provides a clean, unified API for accessing book data.
Why EpubSage?
EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters can be split across files, and metadata formats differ. EpubSage abstracts this complexity:
- Publisher-Agnostic: Tested against real-world books from major technical publishers.
- Complete Extraction: Returns metadata, chapters, word counts, and reading time estimates.
- Modular: Use the high-level
process_epub()function, or access individual parsers directly. - CLI Included: Extract books from the command line without writing code.
Table of Contents
- Installation
- Quick Start
- API Reference
- Command Line Interface
- Architecture
- Error Handling
- Development
- License
Installation
pip install epubsage
For development with uv:
uv add epubsage
Quick Start
Basic Usage
from epub_sage import process_epub
result = process_epub("my_book.epub")
if result.success:
print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Chapters: {result.total_chapters}")
print(f"Words: {result.total_words}")
else:
print(f"Errors: {result.errors}")
Export to JSON
from epub_sage import process_epub, save_to_json
result = process_epub("my_book.epub")
output = {
"title": result.title,
"author": result.author,
"chapters": result.chapters
}
save_to_json(output, "book_data.json")
API Reference
High-Level Functions
| Function | Description |
|---|---|
process_epub(path) |
Process an EPUB file. Returns SimpleEpubResult. |
quick_extract(path) |
Extract EPUB to a directory. Returns path string. |
get_epub_info(path) |
Get file info without extraction. Returns dict. |
save_to_json(data, path) |
Save data to JSON with datetime support. |
parse_content_opf(path) |
Parse .epub or .opf file. Returns ParsedContentOpf. |
Note: All functions accept
.epubfiles directly. No need to extract first!
Classes
SimpleEpubProcessor
Main processing class for full control over the extraction pipeline.
from epub_sage import SimpleEpubProcessor
processor = SimpleEpubProcessor(temp_dir="/tmp/work")
result = processor.process_epub("book.epub", cleanup=True)
# Or process a pre-extracted directory
result = processor.process_directory("/path/to/extracted/")
Methods:
| Method | Description |
|---|---|
process_epub(path, cleanup=True) |
Full pipeline: extract, parse, return result. |
process_directory(path) |
Process already-extracted EPUB contents. |
quick_info(path) |
Return metadata only, minimal processing. |
EpubExtractor
Low-level ZIP handling and file management.
from epub_sage import EpubExtractor
extractor = EpubExtractor(base_dir="/tmp/epubs")
path = extractor.extract_epub("book.epub")
opf = extractor.find_content_opf(path)
extractor.cleanup_extraction(path)
Methods:
| Method | Description |
|---|---|
extract_epub(path) |
Extract ZIP to managed directory. |
get_epub_info(path) |
File stats without extraction. |
find_content_opf(dir) |
Locate content.opf in extracted tree. |
validate_epub_structure(path) |
Check EPUB spec compliance. |
cleanup_extraction(dir) |
Delete extracted files. |
DublinCoreService
High-level service for metadata extraction. Accepts both .epub and .opf files.
from epub_sage import create_service
service = create_service()
# Works with .epub files directly!
metadata = service.extract_basic_metadata("book.epub")
print(f"Title: {metadata['title']}")
print(f"Author: {metadata['author']}")
# Also works with .opf files for backward compatibility
metadata = service.extract_basic_metadata("/path/to/content.opf")
DublinCoreParser
Low-level parser for content.opf files (requires extracted EPUB).
from epub_sage import DublinCoreParser
parser = DublinCoreParser()
result = parser.parse_file("/path/to/content.opf")
print(result.metadata.title)
print(result.metadata.get_primary_author())
print(result.metadata.get_isbn())
EpubStructureParser
Full structure analysis: chapters, parts, images, navigation.
from epub_sage import EpubStructureParser, DublinCoreParser
dc_parser = DublinCoreParser()
opf_data = dc_parser.parse_file(opf_path)
struct_parser = EpubStructureParser()
structure = struct_parser.parse_structure(opf_data, epub_dir)
print(f"Chapters: {len(structure.chapters)}")
print(f"Images: {len(structure.images)}")
Data Models
SimpleEpubResult
Returned by process_epub().
| Field | Type | Description |
|---|---|---|
title |
str |
Book title |
author |
str |
Primary author |
publisher |
str |
Publisher name |
language |
str |
Language code |
chapters |
list[dict] |
Chapter data with content |
total_chapters |
int |
Chapter count |
total_words |
int |
Word count |
estimated_reading_time |
dict |
{'hours': N, 'minutes': N} |
success |
bool |
Processing status |
errors |
list[str] |
Error messages |
DublinCoreMetadata
Pydantic model for metadata.
| Field | Type | Description |
|---|---|---|
title |
str |
Book title |
creators |
list |
Author objects with roles |
publisher |
str |
Publisher name |
language |
str |
ISO language code |
identifiers |
list |
ISBN, UUID, etc. |
dates |
list |
Publication dates |
description |
str |
Book description |
Helper Methods: get_primary_author(), get_isbn(), get_publication_date()
Command Line Interface
Extract to JSON
epub-sage extract book.epub -o output.json
Display Metadata
epub-sage info book.epub
Output:
----------------------------------------
Title: Build a Large Language Model (From Scratch)
Author: Sebastian Raschka
Publisher: Manning Publications Co.
Words: 84287
Est. Time: 5 hours, 37 min
Chapters: 21
----------------------------------------
List Chapters
epub-sage list book.epub
Architecture
epub_sage/
├── core/ # Low-level parsers
│ ├── dublin_core_parser.py
│ ├── structure_parser.py
│ ├── toc_parser.py
│ └── content_classifier.py
├── extractors/ # EPUB handling
│ ├── epub_extractor.py
│ └── content_extractor.py
├── processors/ # High-level pipelines
│ └── simple_processor.py
├── models/ # Pydantic data models
├── services/ # Export, search
└── utils/ # Helpers
Data Flow:
EpubExtractorunzips the EPUB file.DublinCoreParserreadscontent.opffor metadata.EpubStructureParseranalyzes chapters, images, and TOC.ContentExtractorpulls text content from HTML files.SimpleEpubProcessororchestrates all steps and returnsSimpleEpubResult.
Error Handling
SimpleEpubResult.success indicates overall status. Errors are collected in SimpleEpubResult.errors:
result = process_epub("book.epub")
if not result.success:
for error in result.errors:
print(f"Error: {error}")
Common errors:
"File not found"- EPUB path invalid."Invalid ZIP/EPUB file"- Corrupted or non-EPUB file."No content.opf file found"- Missing required metadata file."Content extraction error: ..."- HTML parsing issue.
Development
make install # Setup environment with uv
make format # Run autopep8 and ruff
make lint # Check code quality
make test # Run test suite (60+ tests)
make clean # Remove caches
See CONTRIBUTING.md for contribution guidelines.
License
MIT License. See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file epubsage-0.1.1.tar.gz.
File metadata
- Download URL: epubsage-0.1.1.tar.gz
- Upload date:
- Size: 40.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
836d34fdbacca4ba83a89a922c775d27e3dd5dedb39c99286b5058837077c8c3
|
|
| MD5 |
f2e48a5ae7f9cd3fb4e03c98fcec85e7
|
|
| BLAKE2b-256 |
4d17a5109258178c7ac51370930ba8b72cf1f14de7a6eaf7221db2527c622d98
|
Provenance
The following attestation bundles were made for epubsage-0.1.1.tar.gz:
Publisher:
publish.yml on Abdullah-Wex/epubsage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
epubsage-0.1.1.tar.gz -
Subject digest:
836d34fdbacca4ba83a89a922c775d27e3dd5dedb39c99286b5058837077c8c3 - Sigstore transparency entry: 782571533
- Sigstore integration time:
-
Permalink:
Abdullah-Wex/epubsage@f7dd2ea23a07057a1c138b89e94f21b6b8671db9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Abdullah-Wex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f7dd2ea23a07057a1c138b89e94f21b6b8671db9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file epubsage-0.1.1-py3-none-any.whl.
File metadata
- Download URL: epubsage-0.1.1-py3-none-any.whl
- Upload date:
- Size: 49.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9953183e82205e12f988923e5a58152936fe5f9d683e89c1f16a4fd2990adda5
|
|
| MD5 |
dc37073b3d41f0c12c216987a15e838c
|
|
| BLAKE2b-256 |
3bd652fe717f563cc3f567535e30e7d8a163266222e142c1aca7c735327bd968
|
Provenance
The following attestation bundles were made for epubsage-0.1.1-py3-none-any.whl:
Publisher:
publish.yml on Abdullah-Wex/epubsage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
epubsage-0.1.1-py3-none-any.whl -
Subject digest:
9953183e82205e12f988923e5a58152936fe5f9d683e89c1f16a4fd2990adda5 - Sigstore transparency entry: 782571540
- Sigstore integration time:
-
Permalink:
Abdullah-Wex/epubsage@f7dd2ea23a07057a1c138b89e94f21b6b8671db9 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/Abdullah-Wex
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@f7dd2ea23a07057a1c138b89e94f21b6b8671db9 -
Trigger Event:
release
-
Statement type: