Skip to main content

A professional EPUB structure and content extraction library with Dublin Core metadata support.

Project description

EpubSage

PyPI version Python versions Tests License: MIT

EpubSage is a Python library for extracting structured content and metadata from EPUB files. It handles the complexity of diverse publisher formats (Manning, O'Reilly, Packt, etc.) and provides a clean, unified API for accessing book data.


Why EpubSage?

EPUB files vary significantly between publishers. Headers can be nested in <span> tags, chapters can be split across files, and metadata formats differ. EpubSage abstracts this complexity:

  • Publisher-Agnostic: Tested against real-world books from major technical publishers.
  • Complete Extraction: Returns metadata, chapters, word counts, and reading time estimates.
  • Modular: Use the high-level process_epub() function, or access individual parsers directly.
  • CLI Included: Extract books from the command line without writing code.

Table of Contents

  1. Installation
  2. Quick Start
  3. API Reference
  4. Command Line Interface
  5. Architecture
  6. Error Handling
  7. Development
  8. License

Installation

pip install epubsage

For development with uv:

uv add epubsage

Quick Start

Basic Usage

from epub_sage import process_epub

result = process_epub("my_book.epub")

if result.success:
    print(f"Title: {result.title}")
    print(f"Author: {result.author}")
    print(f"Chapters: {result.total_chapters}")
    print(f"Words: {result.total_words}")
else:
    print(f"Errors: {result.errors}")

Export to JSON

from epub_sage import process_epub, save_to_json

result = process_epub("my_book.epub")

output = {
    "title": result.title,
    "author": result.author,
    "chapters": result.chapters
}
save_to_json(output, "book_data.json")

API Reference

High-Level Functions

Function Description
process_epub(path) Process an EPUB file. Returns SimpleEpubResult.
quick_extract(path) Extract EPUB to a directory. Returns path string.
get_epub_info(path) Get file info without extraction. Returns dict.
save_to_json(data, path) Save data to JSON with datetime support.
parse_content_opf(path) Parse .epub or .opf file. Returns ParsedContentOpf.

Note: All functions accept .epub files directly. No need to extract first!

Classes

SimpleEpubProcessor

Main processing class for full control over the extraction pipeline.

from epub_sage import SimpleEpubProcessor

processor = SimpleEpubProcessor(temp_dir="/tmp/work")
result = processor.process_epub("book.epub", cleanup=True)

# Or process a pre-extracted directory
result = processor.process_directory("/path/to/extracted/")

Methods:

Method Description
process_epub(path, cleanup=True) Full pipeline: extract, parse, return result.
process_directory(path) Process already-extracted EPUB contents.
quick_info(path) Return metadata only, minimal processing.

EpubExtractor

Low-level ZIP handling and file management.

from epub_sage import EpubExtractor

extractor = EpubExtractor(base_dir="/tmp/epubs")
path = extractor.extract_epub("book.epub")
opf = extractor.find_content_opf(path)
extractor.cleanup_extraction(path)

Methods:

Method Description
extract_epub(path) Extract ZIP to managed directory.
get_epub_info(path) File stats without extraction.
find_content_opf(dir) Locate content.opf in extracted tree.
validate_epub_structure(path) Check EPUB spec compliance.
cleanup_extraction(dir) Delete extracted files.

DublinCoreService

High-level service for metadata extraction. Accepts both .epub and .opf files.

from epub_sage import create_service

service = create_service()

# Works with .epub files directly!
metadata = service.extract_basic_metadata("book.epub")
print(f"Title: {metadata['title']}")
print(f"Author: {metadata['author']}")

# Also works with .opf files for backward compatibility
metadata = service.extract_basic_metadata("/path/to/content.opf")

DublinCoreParser

Low-level parser for content.opf files (requires extracted EPUB).

from epub_sage import DublinCoreParser

parser = DublinCoreParser()
result = parser.parse_file("/path/to/content.opf")

print(result.metadata.title)
print(result.metadata.get_primary_author())
print(result.metadata.get_isbn())

EpubStructureParser

Full structure analysis: chapters, parts, images, navigation.

from epub_sage import EpubStructureParser, DublinCoreParser

dc_parser = DublinCoreParser()
opf_data = dc_parser.parse_file(opf_path)

struct_parser = EpubStructureParser()
structure = struct_parser.parse_structure(opf_data, epub_dir)

print(f"Chapters: {len(structure.chapters)}")
print(f"Images: {len(structure.images)}")

Data Models

SimpleEpubResult

Returned by process_epub().

Field Type Description
title str Book title
author str Primary author
publisher str Publisher name
language str Language code
chapters list[dict] Chapter data with content
total_chapters int Chapter count
total_words int Word count
estimated_reading_time dict {'hours': N, 'minutes': N}
success bool Processing status
errors list[str] Error messages

DublinCoreMetadata

Pydantic model for metadata.

Field Type Description
title str Book title
creators list Author objects with roles
publisher str Publisher name
language str ISO language code
identifiers list ISBN, UUID, etc.
dates list Publication dates
description str Book description

Helper Methods: get_primary_author(), get_isbn(), get_publication_date()


Command Line Interface

Extract to JSON

epub-sage extract book.epub -o output.json

Display Metadata

epub-sage info book.epub

Output:

----------------------------------------
Title:     Build a Large Language Model (From Scratch)
Author:    Sebastian Raschka
Publisher: Manning Publications Co.
Words:     84287
Est. Time: 5 hours, 37 min
Chapters:  21
----------------------------------------

List Chapters

epub-sage list book.epub

Architecture

epub_sage/
├── core/                 # Low-level parsers
│   ├── dublin_core_parser.py
│   ├── structure_parser.py
│   ├── toc_parser.py
│   └── content_classifier.py
├── extractors/           # EPUB handling
│   ├── epub_extractor.py
│   └── content_extractor.py
├── processors/           # High-level pipelines
│   └── simple_processor.py
├── models/               # Pydantic data models
├── services/             # Export, search
└── utils/                # Helpers

Data Flow:

  1. EpubExtractor unzips the EPUB file.
  2. DublinCoreParser reads content.opf for metadata.
  3. EpubStructureParser analyzes chapters, images, and TOC.
  4. ContentExtractor pulls text content from HTML files.
  5. SimpleEpubProcessor orchestrates all steps and returns SimpleEpubResult.

Error Handling

SimpleEpubResult.success indicates overall status. Errors are collected in SimpleEpubResult.errors:

result = process_epub("book.epub")

if not result.success:
    for error in result.errors:
        print(f"Error: {error}")

Common errors:

  • "File not found" - EPUB path invalid.
  • "Invalid ZIP/EPUB file" - Corrupted or non-EPUB file.
  • "No content.opf file found" - Missing required metadata file.
  • "Content extraction error: ..." - HTML parsing issue.

Development

make install   # Setup environment with uv
make format    # Run autopep8 and ruff
make lint      # Check code quality
make test      # Run test suite (60+ tests)
make clean     # Remove caches

See CONTRIBUTING.md for contribution guidelines.


License

MIT License. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epubsage-0.1.1.tar.gz (40.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

epubsage-0.1.1-py3-none-any.whl (49.8 kB view details)

Uploaded Python 3

File details

Details for the file epubsage-0.1.1.tar.gz.

File metadata

  • Download URL: epubsage-0.1.1.tar.gz
  • Upload date:
  • Size: 40.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epubsage-0.1.1.tar.gz
Algorithm Hash digest
SHA256 836d34fdbacca4ba83a89a922c775d27e3dd5dedb39c99286b5058837077c8c3
MD5 f2e48a5ae7f9cd3fb4e03c98fcec85e7
BLAKE2b-256 4d17a5109258178c7ac51370930ba8b72cf1f14de7a6eaf7221db2527c622d98

See more details on using hashes here.

Provenance

The following attestation bundles were made for epubsage-0.1.1.tar.gz:

Publisher: publish.yml on Abdullah-Wex/epubsage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file epubsage-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: epubsage-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 49.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for epubsage-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9953183e82205e12f988923e5a58152936fe5f9d683e89c1f16a4fd2990adda5
MD5 dc37073b3d41f0c12c216987a15e838c
BLAKE2b-256 3bd652fe717f563cc3f567535e30e7d8a163266222e142c1aca7c735327bd968

See more details on using hashes here.

Provenance

The following attestation bundles were made for epubsage-0.1.1-py3-none-any.whl:

Publisher: publish.yml on Abdullah-Wex/epubsage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page