A professional EPUB structure and content extraction library with Dublin Core metadata support.

These details have not been verified by PyPI

Project description

EpubSage

EpubSage is a Python library for extracting structured content and metadata from EPUB files. It handles the complexity of diverse publisher formats (Manning, O'Reilly, Packt, etc.) and provides a clean, unified API for accessing book data.

Why EpubSage?

EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters can be split across files, and metadata formats differ. EpubSage abstracts this complexity:

Publisher-Agnostic: Tested against real-world books from major technical publishers.
Complete Extraction: Returns metadata, chapters, word counts, and reading time estimates.
Modular: Use the high-level process_epub() function, or access individual parsers directly.
CLI Included: Extract books from the command line without writing code.

Installation
Quick Start
API Reference
Command Line Interface
Architecture
Error Handling
Development
License

Installation

pip install epubsage

For development with uv:

uv add epubsage

Quick Start

Basic Usage

from epub_sage import process_epub

result = process_epub("my_book.epub")

if result.success:
    print(f"Title: {result.title}")
    print(f"Author: {result.author}")
    print(f"Chapters: {result.total_chapters}")
    print(f"Words: {result.total_words}")
else:
    print(f"Errors: {result.errors}")

Export to JSON

from epub_sage import process_epub, save_to_json

result = process_epub("my_book.epub")

output = {
    "title": result.title,
    "author": result.author,
    "chapters": result.chapters
}
save_to_json(output, "book_data.json")

API Reference

High-Level Functions

Function	Description
`process_epub(path)`	Process an EPUB file. Returns `SimpleEpubResult`.
`quick_extract(path)`	Extract EPUB to a directory. Returns path string.
`get_epub_info(path)`	Get file info without extraction. Returns dict.
`save_to_json(data, path)`	Save data to JSON with datetime support.
`parse_content_opf(path)`	Parse `.epub` or `.opf` file. Returns `ParsedContentOpf`.

Note: All functions accept .epub files directly. No need to extract first!

Classes

SimpleEpubProcessor

Main processing class for full control over the extraction pipeline.

from epub_sage import SimpleEpubProcessor

processor = SimpleEpubProcessor(temp_dir="/tmp/work")
result = processor.process_epub("book.epub", cleanup=True)

# Or process a pre-extracted directory
result = processor.process_directory("/path/to/extracted/")

Methods:

Method	Description
`process_epub(path, cleanup=True)`	Full pipeline: extract, parse, return result.
`process_directory(path)`	Process already-extracted EPUB contents.
`quick_info(path)`	Return metadata only, minimal processing.

EpubExtractor

Low-level ZIP handling and file management.

from epub_sage import EpubExtractor

extractor = EpubExtractor(base_dir="/tmp/epubs")
path = extractor.extract_epub("book.epub")
opf = extractor.find_content_opf(path)
extractor.cleanup_extraction(path)

Methods:

Method	Description
`extract_epub(path)`	Extract ZIP to managed directory.
`get_epub_info(path)`	File stats without extraction.
`find_content_opf(dir)`	Locate `content.opf` in extracted tree.
`validate_epub_structure(path)`	Check EPUB spec compliance.
`cleanup_extraction(dir)`	Delete extracted files.

DublinCoreService

High-level service for metadata extraction. Accepts both .epub and .opf files.

from epub_sage import create_service

service = create_service()

# Works with .epub files directly!
metadata = service.extract_basic_metadata("book.epub")
print(f"Title: {metadata['title']}")
print(f"Author: {metadata['author']}")

# Also works with .opf files for backward compatibility
metadata = service.extract_basic_metadata("/path/to/content.opf")

DublinCoreParser

Low-level parser for content.opf files (requires extracted EPUB).

from epub_sage import DublinCoreParser

parser = DublinCoreParser()
result = parser.parse_file("/path/to/content.opf")

print(result.metadata.title)
print(result.metadata.get_primary_author())
print(result.metadata.get_isbn())

EpubStructureParser

Full structure analysis: chapters, parts, images, navigation.

from epub_sage import EpubStructureParser, DublinCoreParser

dc_parser = DublinCoreParser()
opf_data = dc_parser.parse_file(opf_path)

struct_parser = EpubStructureParser()
structure = struct_parser.parse_structure(opf_data, epub_dir)

print(f"Chapters: {len(structure.chapters)}")
print(f"Images: {len(structure.images)}")

Data Models

SimpleEpubResult

Returned by process_epub().

Field	Type	Description
`title`	`str`	Book title
`author`	`str`	Primary author
`publisher`	`str`	Publisher name
`language`	`str`	Language code
`chapters`	`list[dict]`	Chapter data with content
`total_chapters`	`int`	Chapter count
`total_words`	`int`	Word count
`estimated_reading_time`	`dict`	`{'hours': N, 'minutes': N}`
`success`	`bool`	Processing status
`errors`	`list[str]`	Error messages

DublinCoreMetadata

Pydantic model for metadata.

Field	Type	Description
`title`	`str`	Book title
`creators`	`list`	Author objects with roles
`publisher`	`str`	Publisher name
`language`	`str`	ISO language code
`identifiers`	`list`	ISBN, UUID, etc.
`dates`	`list`	Publication dates
`description`	`str`	Book description

Helper Methods: get_primary_author(), get_isbn(), get_publication_date()

Command Line Interface

Extract to JSON

epub-sage extract book.epub -o output.json

Display Metadata

epub-sage info book.epub

Output:

----------------------------------------
Title:     Build a Large Language Model (From Scratch)
Author:    Sebastian Raschka
Publisher: Manning Publications Co.
Words:     84287
Est. Time: 5 hours, 37 min
Chapters:  21
----------------------------------------

List Chapters

epub-sage list book.epub

Architecture

epub_sage/
├── core/                 # Low-level parsers
│   ├── dublin_core_parser.py
│   ├── structure_parser.py
│   ├── toc_parser.py
│   └── content_classifier.py
├── extractors/           # EPUB handling
│   ├── epub_extractor.py
│   └── content_extractor.py
├── processors/           # High-level pipelines
│   └── simple_processor.py
├── models/               # Pydantic data models
├── services/             # Export, search
└── utils/                # Helpers

Data Flow:

EpubExtractor unzips the EPUB file.
DublinCoreParser reads content.opf for metadata.
EpubStructureParser analyzes chapters, images, and TOC.
ContentExtractor pulls text content from HTML files.
SimpleEpubProcessor orchestrates all steps and returns SimpleEpubResult.

Error Handling

SimpleEpubResult.success indicates overall status. Errors are collected in SimpleEpubResult.errors:

result = process_epub("book.epub")

if not result.success:
    for error in result.errors:
        print(f"Error: {error}")

Common errors:

"File not found" - EPUB path invalid.
"Invalid ZIP/EPUB file" - Corrupted or non-EPUB file.
"No content.opf file found" - Missing required metadata file.
"Content extraction error: ..." - HTML parsing issue.

Development

make install   # Setup environment with uv
make format    # Run autopep8 and ruff
make lint      # Check code quality
make test      # Run test suite (60+ tests)
make clean     # Remove caches

See CONTRIBUTING.md for contribution guidelines.

License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.0

Jan 9, 2026

0.2.0

Jan 3, 2026

This version

0.1.1

Dec 30, 2025

0.1.0

Dec 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epubsage-0.1.1.tar.gz (40.2 kB view details)

Uploaded Dec 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

epubsage-0.1.1-py3-none-any.whl (49.8 kB view details)

Uploaded Dec 30, 2025 Python 3

File details

Details for the file epubsage-0.1.1.tar.gz.

File metadata

Download URL: epubsage-0.1.1.tar.gz
Upload date: Dec 30, 2025
Size: 40.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epubsage-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`836d34fdbacca4ba83a89a922c775d27e3dd5dedb39c99286b5058837077c8c3`
MD5	`f2e48a5ae7f9cd3fb4e03c98fcec85e7`
BLAKE2b-256	`4d17a5109258178c7ac51370930ba8b72cf1f14de7a6eaf7221db2527c622d98`

See more details on using hashes here.

Provenance

The following attestation bundles were made for epubsage-0.1.1.tar.gz:

Publisher: publish.yml on Abdullah-Wex/epubsage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: epubsage-0.1.1.tar.gz
- Subject digest: 836d34fdbacca4ba83a89a922c775d27e3dd5dedb39c99286b5058837077c8c3
- Sigstore transparency entry: 782571533
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: Abdullah-Wex/epubsage@f7dd2ea23a07057a1c138b89e94f21b6b8671db9
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Abdullah-Wex
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f7dd2ea23a07057a1c138b89e94f21b6b8671db9
- Trigger Event: release

File details

Details for the file epubsage-0.1.1-py3-none-any.whl.

File metadata

Download URL: epubsage-0.1.1-py3-none-any.whl
Upload date: Dec 30, 2025
Size: 49.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epubsage-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9953183e82205e12f988923e5a58152936fe5f9d683e89c1f16a4fd2990adda5`
MD5	`dc37073b3d41f0c12c216987a15e838c`
BLAKE2b-256	`3bd652fe717f563cc3f567535e30e7d8a163266222e142c1aca7c735327bd968`

See more details on using hashes here.

Provenance

The following attestation bundles were made for epubsage-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Abdullah-Wex/epubsage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: epubsage-0.1.1-py3-none-any.whl
- Subject digest: 9953183e82205e12f988923e5a58152936fe5f9d683e89c1f16a4fd2990adda5
- Sigstore transparency entry: 782571540
- Sigstore integration time: Dec 30, 2025
Source repository:
- Permalink: Abdullah-Wex/epubsage@f7dd2ea23a07057a1c138b89e94f21b6b8671db9
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/Abdullah-Wex
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f7dd2ea23a07057a1c138b89e94f21b6b8671db9
- Trigger Event: release

epubsage 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

EpubSage

Why EpubSage?

Table of Contents

Installation

Quick Start

Basic Usage

Export to JSON

API Reference

High-Level Functions

Classes

SimpleEpubProcessor

EpubExtractor

DublinCoreService

DublinCoreParser

EpubStructureParser

Data Models

SimpleEpubResult

DublinCoreMetadata

Command Line Interface

Extract to JSON

Display Metadata

List Chapters

Architecture

Error Handling

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance