Skip to main content

An EPUB3 python library

Project description

EPUBLib

A spec compliant, memory efficient EPUB3 library. Designed for editing EPUBs, but can also create them.

  • Spec compliant: code aims at being compliant with the EPUB 3.3 specification (although it does not attempt to validate the EPUB. Use Ace by Daisy and EPUBCheck for that);
  • Memory efficient: leverages python standard library's zipfile module to load data into memory as needed only;
  • Designed for editing: handles EPUBs non intrusively (e.g. won't recreate the manifest and the metadata).

Installation

pip install epublib

Dependencies

Installing EPUBLib will also install its dependencies:

  • BeautifulSoup (pip install beautifulsoup)
  • lxml (pip install lxml)

Contributing

  1. Clone from gitlab.com/joaoseckler/epublib.
  2. Use uv to manage development dependencies. Sync with uv sync --all-packages
  3. pre-commit install
  4. Use basedpyright to type check your contribution. There is currently no pre-commit rule for it, but contributions are expected not to introduce any type checking errors or warnings.

Related

Usage

Basic usage

from epublib import EPUB

with EPUB("book.epub") as book:
    book.metadata.title = "New title"

    for doc in book.documents:
        new_script = doc.soup.new_tag("script", attrs={"src": "../Misc/myscript.js"})
        doc.soup.head.append(new_script)

        new_heading = doc.soup.new_tag("h1", string="New heading")
        doc.soup.body.insert(0, new_heading)

    book.update_manifest_properties()
    book.write("book-modified.epub")

Reading, writing and creating

from epublib import EPUB

# From path
with EPUB("book.epub") as book:
    book.write("book-modified.epub")

# From file
with open("book.epub", "rb") as read_file:
    with EPUB(read_file) as book, open("book-modified.epub", "wb") as f:
            book.write(f)

# Read from folder path (unzipped EPUB)
with EPUB("book-folder/") as book:
    book.write_to_folder("book-folder-modified/")

# Create new EPUB
with EPUB() as book:
    book.metadata.title = "A new book"
    book.metadata.identifier = "urn:uuid:123e4567-e89b-12d3-a456-426614174000"
    book.metadata.language = "en"
    book.nav.soup.title.string = "Navigation title"

    # the default TOC comes with one single self referential item
    book.nav.toc.text = "Toc title" # Title of the toc
    item_referencing_toc = next(book.nav.toc.items_referencing(book.nav.filename))
    item_referencing_toc.text = "Toc title"

EPUBLib does not guarantee the validity of the EPUB resulting from calling EPUB(). It is the user's responsability to add, at least:

  • a title (book.metadata.title = <title>)
  • an identifier (book.metadata.identifier = <id>)
  • a language (book.metadata.language = <language>)
  • A title for the navigation document (book.nav.soup.title.string = <title>)
  • A title for the elements of the table of contents (see example above for one way of doing it)

Accessing resources

Each resource corresponds to a file in the EPUB archive.

import zipfile

from epublib import EPUB
from epublib.media_type import MediaType, Category

with EPUB("book.epub") as book:
    book.resources #  all resources
    print([resource.filename for resource in book.resources])
    # [
    #     "mimetype",
    #     "META-INF/container.xml",
    #     "content.opf",
    #     "Text/chapter1.xhtml",
    #     "Images/image.png",
    #     ...,
    # ]

    resource = book.resources.get("Text/chapter1.xhtml")

    assert resource.filename == "Text/chapter1.xhtml"
    assert isinstance(resource.content, bytes)
    assert isinstance(resource.zipinfo, zipfile.ZipInfo)

    documents = book.documents # All XHTML and SVG resources
    images = book.images # All image resources
    scripts = book.scripts # All JavaScript resources
    styles = book.styles # All style resources

    assert book.resources.get("Text/chapter1.xhtml") # ContentDocument(Text/chapter1.xhtml)
    assert book.documents.get("Text/chapter1.xhtml") is book.resources.get("Text/chapter1.xhtml")
    assert book.images.get("Text/chapter1.xhtml") is None
    assert book.resources.get("Images/image.png") # PublicationResource(Images/image.png)

    pngs = book.resources.filter(MediaType.IMAGE_PNG) # All PNG images
    assert all(img.media_type is MediaType.IMAGE_PNG for img in pngs)

    images = book.resources.filter(Category.IMAGE) # All images. Same as book.images()
    assert all(img.media_type.category is Category.IMAGE for img in images)

Creating

from epublib import EPUB
from epublib.identifier import EPUBId
from epublib.resources import PublicationResource, ContentDocument
from epublib.resources.create import create_resource_from_path, create_resource

with EPUB("book.epub") as book:
    # Create a new resource from filesystem path
    new_resource = create_resource_from_path("new-image.jpg", "Images/name-in-epub.jpg")
    assert isinstance(new_resource, PublicationResource)
    book.resources.add(resource=new_resource)

    # Create a new resource from content

    xhtml = """
    <?xml version="1.0" encoding="utf-8"?>
    <!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
    <head>
      <title>A Small Document</title>
    </head>
    <body>
      <p>A simple page!</p>
    </body>
    </html>
    """

    new_resource = create_resource(xhtml.encode(), "Text/Chapter4.xhtml")
    assert isinstance(new_resource, ContentDocument)
    book.resources.add(resource=new_resource)

    # More options when adding are available (see full signature in the API
    # documentation)
    new_resource = create_resource(xhtml.encode(), "Text/Chapter5.xhtml")
    book.resources.add(
        resource=new_resource,
        is_cover = False,
        position = 0, # position in book.resources list
                      # (and thus in archive). Default: None
        after = "Text/Chapter1.xhtml", # insert after this resource, default: None
        before = None,                 # insert before this resource

        # if None, it will be added unless it is the mimetype or the container.xml
        # file. caution: setting this to False or True may yield invalid EPUBs
        add_to_manifest = None,

        add_to_spine = None,
        spine_position = None,
        linear = None,
        add_to_toc = None,
        toc_position = None,
    )

Removing

from epublib import EPUB
from epublib.identifier import EPUBId

with EPUB("book.epub") as book:
    resource = book.resources.get("Text/chapter1.xhtml")
    book.resources.remove(resource)

    # It is possible to use the filename directly
    book.resources.remove("Images/image.png")

    # or the manifest item id
    book.resources.remove(EPUBId("image2"))

    # If it is a CSS or JS file, you can set the remove_css_js_links flag
    # To remove any <link rel="stylesheet"> or <script> tags pointing to it
    book.resources.remove("Styles/style.css", remove_css_js_links=True)

    # If it has any other type, you'll have to individually remove any
    # references to it

Renaming

from epublib import EPUB
from epublib.identifier import EPUBId

with EPUB("book.epub") as book:
    resource = book.resources.get("Text/chapter1.xhtml")
    book.resources.rename(resource, "Text/chapter-one.xhtml")

    # The same can be achieved by
    book.resources.rename("Text/chapter-one.xhtml", "Text/chapter1.xhtml")

    # or
    book.resources.rename(EPUBId("chapter1"), "Text/chapter-one.xhtml")

By default, renaming a resource will update all references to it in the rest of the book -- namely, in every XMLResource (see below). If you want to rename a resource without updating references to it, you can set the update_references flag to False:

from epublib import EPUB

with EPUB("book.epub") as book:
    book.resources.rename(
        "Text/chapter1.xhtml",
        "Text/chapter-one.xhtml",
        update_references=False,
    )

By default, these references are looked up by using the following XML attributes: ["href", "src", "full-path", "xlink:href"]. If you want to use a different set of attributes, you can pass them as a list to the reference_attrs parameter:

from epublib import EPUB

with EPUB("book.epub") as book:
    book.resources.rename(
        "Text/chapter1.xhtml",
        "Text/chapter-one.xhtml",
        reference_attrs=["data-src", "href"],
    )

Internal representation

Resources are represented by instances of epublib.resources.Resource or one of its subclasses, depending on the type of resource:

  • Resource: generic resource. Usually, the only file in the EPUB that is represented by a generic Resource is the mimetype file;

  • XMLResource: XML resources (XHTML, SVG, XML). Provides a soup attribute representing the content as a BeautifulSoup object. Subclasses Resource;

  • PublicationResource: A resource that contributes to the logic and rendering of the publication. This includes CSS files, fonts, images, JavaScript files, XHTML and SVG (although the last two have their own specific subclass: see below). All publication resources should have a manifest entry associated to them. Provides a media_type: MediaType (more on media types below). Subclasses Resource;

  • ContentDocument: A XHTML or SVG document. Subclasses XMLResource and PublicationResource;

  • PackageDocument: The package document (content.opf). Subclasses XMLResource. More about the package document below;

  • NavigationDocument: A XHTML or SVG document that represents the navigation document of the EPUB (the one with properties="nav" in the manifest). Subclasses ContentDocument. More about the navigation document below.

  • NCXFile: A XML document that represents the NCX file of the EPUB (if it exists). Subclasses PublicationResource and XMLResource. More about the NCX file below.

The class hierarchy is as follows:

                     ┌────────┐
                ┌────│Resource│───────┐
                │    └────────┘       │
                │                     │
                │                     │
                │                     │
           ┌────▼──────┐    ┌───────────────────┐
      ┌────│XMLResource│──┬─│PublicationResource│
      │    └───────────┘  │ └───────────────────┘
      │                   │
      │                   ├─────────────┐
      │                   │             │
┌─────▼─────────┐ ┌───────▼───────┐ ┌───▼───┐
│PackageDocument│ │ContentDocument│ │NCXFile│
└───────────────┘ └───────────────┘ └───────┘
                         │
                         │
                 ┌───────▼──────────┐
                 │NavigationDocument│
                 └──────────────────┘

The package document

The package document (sometimes referred to as OPF or content.opf) is "an XML document that consists of a set of elements that each encapsulate information about a particular aspect of an EPUB publication" (from the spec). It contains:

  • Metadata: title, author, language, date, etc;
  • Manifest: list of all resources in the EPUB;
  • Spine: reading order of resources;
  • Collections (optional): groupings of resources;
  • Manifest fallback chains (optional): define equivalence of resources to be used as fallbacks.

EPUBLib has specific features for handling the first three elements. Further reading at the spec section about the package document. The package document itself is a resource from the epub and is available at book.package_document.

Metadata

from datetime import datetime
from epublib import EPUB

with EPUB("book.epub") as book:
    print(book.metadata) # BookMetadata(10 items)

    # book.metadata is an alias of book.package_document.metadata
    assert book.metadata is book.package_document.metadata

    # Mandatory metadata fields are available as attributes of convenient types
    assert isinstance(book.metadata.title, str)
    assert isinstance(book.metadata.language, str)
    assert isinstance(book.metadata.modified, datetime)
    book.metadata.title = "New title"
    book.metadata.modified = datetime.now()

    # Access as item (read-only) yields internal representation
    print(book.metadata["title"])
    # DublinCoreMetadataItem(
    #     name='title',
    #     tag=<dc:title>New title</dc:title>,
    #     value='New title',
    #     id=None,
    #     dir=None,
    #     lang=None
    # )
Adding metadata
from epublib import EPUB
from epublib.package.metadata import (
    GenericMetadataItem,
    DublinCoreMetadataItem,
)

with EPUB("book.epub") as book:
    new_item = book.metadata.add("pageBreakSource", "Our print version, 1976")
    new_item_dc = book.metadata.add_dc("rights", "© 1976 Our Publisher")

    assert isinstance(new_item, GenericMetadataItem)
    assert isinstance(new_item_dc, DublinCoreMetadataItem)

    print(new_item)
    # GenericMetadataItem(name='pageBreakSource',
    #     tag=<meta property="pageBreakSource">Our print version,
    #     1976</meta>,
    #     value='Our print version,
    #     1976',
    #     id=None,
    #     dir=None,
    #     lang=None,
    #     refines=None,
    #     scheme=None
    # )

    print(new_item_dc)
    # DublinCoreMetadataItem(
    #     name='rights',
    #     tag=<dc:rights>© 1976 Our Publisher</dc:rights>,
    #     value='© 1976 Our Publisher',
    #     id=None,
    #     dir=None,
    #     lang=None
    # )
Adding other types of metadata
from epublib import EPUB
from epublib.package.metadata import MetadataItem, LinkMetadataItem

with EPUB("book.epub") as book:
    link_item = LinkMetadataItem(
        soup=book.package_document.soup,
        href="front.xhtml#meta-json",
        rel="record",
        media_type="application/xhtml+xml",
        hreflang="en",
    )
    book.metadata.add_item(link_item)

    # You can also create your own custom metadata items by subclassing MetadataItem
    from custom_item import create_some_custom_item

    custom_item = create_some_custom_item()
    assert isinstance(custom_item, MetadataItem)
    book.metadata.add_item(custom_item)
Getting all metadata
from epublib import EPUB

with EPUB("book.epub") as book:
    book.metadata.items # Each item in internal representation
    book.metadata.tag # The full metadata tag as an bs4.Tag element

Manifest

From the spec, the manifest "provides an exhaustive list of publication resources used in the rendering of the content." Each of its items needs to have:

  • an href, a relative path to the resource in the archive;
  • a media-type (see media types below);
  • a unique identifier;

and can optionally have:

The manifest is internally represented by BookManifest, and each item by ManifestItem. Instead of the relative path, we primarily use the absolute path of each resource to identify it in the EPUB (corresponding to the href and filename attributes of ManifestItem, respectivelly). If you whish to use the identifier instead, you can signal that by using EPUBId, a str subclass, to wrap the identifier string.

from epublib import EPUB
from epublib.package.manifest import BookManifest, ManifestItem
from epublib.identifier import EPUBId

with EPUB("book.epub") as book:
    # book.manifest is an alias of book.package_document.manifest
    assert book.manifest is book.package_document.manifest

    print(book.manifest) # BookManifest(4 items)
    assert all(isinstance(item, ManifestItem) for item in book.manifest.items)

    # Get manifest item by filename (absolute path). Raise KeyError if not found
    item = book.manifest["Text/chapter1.xhtml"]
    assert item

    # Get manifest item, return None if not found
    item = book.manifest.get("Text/chapter99.xhtml")
    assert item is None

    # Get manifest item by identifier (EPUBId)
    nav_item = book.manifest[EPUBId("nav")]
    assert nav_item

Adding and removing manifest items are normally done when adding or removing resources (see above), which is done under the hood by EPUB.resources. If you need custom control of manifest items regardless of their resource counterparts, you can use the add_item, insert_item and remove_item methods of BookManifest. Caution is advised, as this may result in invalid EPUBs.

Manifest properties

Each manifest item can have a set of properties, which convey additional information about the resource (read more in the spec). A non-exhaustive list of properties follows:

from epublib import EPUB
from epublib.identifier import EPUBId

with EPUB("book.epub") as book:
    item = book.manifest.get("Text/chapter1.xhtml")

    # Only do this if there are external links in chapter 1
    item.add_property("remote-resources")
    # Only do this if there are math expressions in chapter 1
    item.add_property("mathml")

    item.remove_property("remote-resources")

    assert item.has_property("mathml")
    assert not item.has_property("remote-resources")

    # There are shortcuts to the nav item and the cover image item.
    assert book.manifest.nav is book.manifest[EPUBId("nav")]

    # Get the manifest item corresponding to the cover image. Currently,
    # there is no cover.
    assert book.manifest.cover_image is None

    # Promote some image to cover image
    book.resources.set_cover_image("Images/image.png")

    assert book.manifest.cover_image is book.manifest["Images/image.png"]
    assert book.resources.cover_image is book.resources["Images/image.png"]

Spine

The spine defines the default reading order of the publication. Each spine item conveys the following information:

  • idref (required): the identifier of the corresponding manifest item;
  • linear: whether the item is part of the default reading order or not;
  • properties (optional): additional information about the item;
  • id: an identifier for the spine item itself.

Only the first one is mandatory. The spine is internally represented by BookSpine (found at book.spine, an alias of book.package_document.spine), and each item by SpineItemRef. Different than manifest items, spine items are primarily identified by their idref (their only required attribute).

from epublib import EPUB
import random

with EPUB("book.epub") as book:
    print(book.spine) # BookSpine(2 items)

    assert book.spine["nav"]
    assert book.spine["chapter1"]

    # Getting spine item by position
    assert book.spine[0] is book.spine["chapter1"]

    # If you need to get a spine item by its filename, go through the
    # manifest first (since the filename information is not stored in the spine):
    item = book.spine[book.manifest["Text/chapter1.xhtml"].id]

    # To reorder the spine, you can use the move_item method:
    book.spine.move_item("nav", 0) # Move nav to the beginning of the spine
    assert book.spine[0].idref == "nav"

    # Or completely reorder the spine
    new_order = list(book.spine.items)
    random.shuffle(new_order)

    book.spine.reorder(new_order)
    assert list(book.spine.items) == new_order

As with the manifest, adding and removing spine items are normally done when adding or removing resources (see above). Refer to the following parameters of the EPUB.resources.add method:

  • after and before;
  • add_to_spine;
  • spine_position;
  • linear.

If you need custom control of spine items the add_item, insert_item and remove_item methods of BookSpine. Caution is advised, as this may result in invalid EPUBs.

Navigation document

The navigation document is a special XHTML document that contains "human- and machine-readable global navigation information." (from the spec). In other words, it is a regular XHTML file with some extra requirements:

  • Must include exactly one nav html element with epub:type="toc" (the table of contents);
  • All nav html elements with a epub:type attribute, including the table of contents, must follow a specific structure, using only ordered lists (ol, possibly nested), list items (li), spans (span) and anchors (a);

There may also exist other nav elements with different epub:type attributes. The spec talks about two other types:

  • page-list: a list of links to the locations in the publication that correspond to page numbers in a print edition of the work;
  • landmarks: a list of links to important locations in the publication, such as the title page, table of contents, main content, bibliography, etc.

This requirements allow EPUBLib to provide specific features for handling the navigation document, which is represented by a NavigationDocument resource, available at book.nav. There are features for handling the table of contents, page list and landmarks.

from epublib import EPUB
from epublib.resources import ContentDocument

with EPUB("book.epub") as book:
    # Table of contents
    book.reset_toc(
        targets_selector = "h1, h2, h3",  # defaults to all headings, in which case
                                          # a nested toc is created
        include_filenames = False,        # Whether to include filenames in TOC entries
                                          # (i.e. hrefs with no fragments)
        spine_only = True,                # Only read from resources in the spine
                                          # (yields correctly orderered TOC)
        resource_class = ContentDocument, # Only consider resources of this class
        title="Table of contents",        # Title of the TOC
    )

    # Landmarks
    book.create_landmarks(
        include_toc = True,                          # Include TOC in landmarks
        targets_selector = "#landmark1, #landmark2", # Defaults to None,
                                                     # selecting no landmark
    )

    # This will error if a landmarks list already exists. Use the following
    # to force recreation
    book.reset_landmarks()


    # Page list
    book.create_page_list(
        id_format = "page_{page}", # If a page breaks is identified but has
                                   # no id, use this format to attribute one
        label_format = "{page}",   # Format for page label, shown in the page list
        pagebreak_selector = '[role="doc-pagebreak"], [epub|type="pagebreak"]',
    )

    # This will error if a toc already exists. Use the following to force recreation
    book.reset_page_list()

NCX file

The NCX file is an XML file used in EPUB 2 publications to define the table of contents. It has been superseded by the navigation document, but may optionally be included in EPUB 3 publications for backwards compatibility with EPUB 2 readers. There are several features of the NCX format, only part of which are represented in EPUBLib:

  • head element contains metadata, some of which are required (uid, depth, totalPageCount, maxPageNumber);
  • docTitle element contains the title of the publication;
  • docAuthor elements contain the authors of the publication;
  • navMap element contains the actual table of contents;
  • pageList element contains the list of pages.
  • navList elements (any number of them) can contains other lists of points of interest.

Currentrly, epublib does not handle any features of the NCX related to SMIL and doesn't not handle audio and image tags inside navPoints or pageTarget's.

Refer to the specification for more details.

from epublib import EPUB
from epublib.ncx import NCXHead, NCXNavMap, NCXPageList

with EPUB("book.epub") as book:
    book.generate_ncx() # use reset_ncx if one already exists
    assert book.ncx
    assert book.ncx.nav_map
    assert book.ncx.head


    assert isinstance(book.ncx.head, NCXHead)
    assert isinstance(book.ncx.nav_map, NCXNavMap)
    assert book.ncx.page_list is None # No page list yet!

    item = book.ncx.nav_map.items[0]

    assert item.href == "Text/chapter1.xhtml"
    assert item.text == "Start"

    # Will recreate the nav_map unless reset_ncx is False or there is no NCX file
    book.reset_toc(reset_ncx=True)

    # Will recreate the page_list unless reset_ncx is False or there is no NCX file
    book.reset_page_list(reset_ncx=True)
    assert isinstance(book.ncx.page_list, NCXPageList)


    # To synchronize specific parts of the NCX file with the rest of the book:
    book.ncx.sync_head(book.metadata)
    book.ncx.sync_toc(book.nav)
    book.ncx.sync_page_list(book.nav)

    # Update metadata numbers in the head of the NCX which are calculated
    # (depth, total page count, max page number and play order)
    book.ncx.update_numbers()

    # Use reset_ncx to do all of the above at once
    book.reset_ncx()

Soup and internal representations

tl;dr: If possible, do not alter the soup attribute of PackageDocument, NavigationDocument or NCXFile directly. If you do need to alter them, make sure to call book.package_document.on_soup_change() or book.nav.on_soup_change() and book.ncx.on_soup_change() afterwards.

The features described above for handling the package document, the navigation document and the NCX file involve parsing the corresponding XML/XHTML files and building a internal representation of their content. These representations are built lazily (i.e., the parsing only occurs when some of the representation is accessed). Due to the mutable nature of BeautifulSoup objects, the user may inadvertently introduce discrepancies between them and the internal representation, which may lead to errors. For example, if a user adds an item tag directly to the soup of the package document, there is no way for EPUBLib to know about the new item and add it to the BookManifest object.

If you do need to alter the soup attribute of these resources (or the tag attributes of the internal representations), there may be two scenarios:

  1. You don't need the internal representation, so we're all good.

    from epublib import EPUB
    
    with EPUB("book.epub") as book:
        new_tag = book.package_document.soup.new_tag(
            "item",
            attrs={"href": "file.txt", "media-type": "text/plain", "id": "file"},
        )
        book.manifest.tag.append(new_tag)
        book.write("book-modified.epub") # All good
    
  2. You do need the internal representation. In this case, you need to call the on_soup_change method of the corresponding resource after altering its soup.

    from epublib import EPUB
    
    with EPUB("book.epub") as book:
        new_tag = book.package_document.soup.new_tag(
            "item",
            attrs={"href": "file.txt", "media-type": "text/plain", "id": "file"},
        )
        book.package_document.soup.manifest.append(new_tag)
    
        # Mark the internal representation for reparsing
        book.package_document.on_soup_change()
    
        # Internal representation is up to date
        assert book.manifest.get("file.txt")
    

Note that the internal representation reflect its changes to the soup, so you don't need to do anything to see the changes there.

from epublib import EPUB
from epublib.resources.create import create_resource

with EPUB("book.epub") as book:
    book.resources.add_to_manifest(
        create_resource(b"Some text content", "Text/file.txt"),
        identifier="new-item"
    )

    assert book.package_document.soup.find(id="new-item")

If you completely overwrite the soup attribute of these resources, there is also no need to call on_soup_change, as the property setter will already do that for you. This is why there is no similar issue with the contents attribute: since bytes are immutable, every change to it will trigger a reparse from the property setter.

Media types

Media types (also known as MIME types or content types) are strings that represent the format of a file. They are used in EPUBs to describe the format of each resource, and are required in every manifest item.

EPUBLib provides a MediaType class that represents media types, both core and foreign.

We also introduce a helper class called Category, which represents the main category of a media type. For example, the media type image/png (MediaType.IMAGE_PNG) has the category Category.IMAGE.

from epublib.media_type import MediaType, Category

# From filename
assert MediaType.from_filename("image.png") is MediaType.IMAGE_PNG
assert MediaType.from_filename("image.jpg") is MediaType.IMAGE_JPEG
assert MediaType.from_filename("audio.ogg") is MediaType.AUDIO_OGG

# From mimetype string
assert MediaType("font/ttf") is MediaType.FONT_TTF
assert MediaType("text/css") is MediaType.CSS

# Utilities
assert MediaType.from_filename("script.js").is_js()
assert MediaType.from_filename("style.css").is_css()

# The category and mimetype are available as a properties in MediaType instances
media_type = MediaType.from_filename("image.png")
assert media_type.category is Category.IMAGE
assert media_type.value == "image/png"

The MediaType is a "flexible" enum, meaning you can instantiate it with any string value. This implementation comes from the non restrictive nature of the spec regargind media types. Any valid mime type is allowed, even if not listed as one the "core media types". In this case, the resource in question is called a foreign resource.

from epublib.media_type import MediaType, Category

media_type = MediaType("application/x-zerosize")
assert media_type.value == "application/x-zerosize"
assert media_type.category is Category.FOREIGN
assert media_type == MediaType("application/x-zerosize")

media_type = MediaType.from_filename("file.jar")
assert media_type.value == "application/java-archive"
assert media_type.category == Category.FOREIGN

Utilities

Relative path resolution

When dealing with EPUBs it is often necessary to, given a relative path (e.g. in an href or src attribute), find the full path of the referred file. The other way around may also be necessary: given the absolute filename, find the relative path from some resource to that filename. Two helper functions are provided for this:

from epublib.util import get_absolute_href, get_relative_href
from epublib import EPUB

with EPUB("book.epub") as book:
    href = book.nav.soup.select_one("a")["href"] # "chapter1.xhtml"
    absolute_path = get_absolute_href(
        origin_href=book.nav.filename, # "Text/nav.xhtml"
        href=href,                     # "chapter1.xhtml"
    )

    assert absolute_path == "Text/chapter1.xhtml"

    # Vice versa:
    relative_path = get_relative_href(
        relative_to=book.nav.filename, # "Text/nav.xhtml"
        absolute_href="Text/chapter1.xhtml",
    )

    assert relative_path == "chapter1.xhtml"

At a higher level, the EPUB.resources provides a method for resolving a string representing an href (possibly with a fragment) to the actual resource it refers to (and optionally to the tag is refers to): resolve_href.

import bs4
from epublib import EPUB

with EPUB("book.epub") as book:
    resource = book.resources.resolve_href("Text/chapter1.xhtml#section1", with_tag=False)
    assert resource is book.resources.get("Text/chapter1.xhtml")

    # If the href is found inside some resource, you can use the
    # `relative_to` parameter
    resource = book.resources.resolve_href(
        "../Text/chapter1.xhtml#section1",
        with_tag=False,
        relative_to="Styles/style.css",
    )
    assert resource is book.resources.get("Text/chapter1.xhtml")

    # To capture the tag the href refers to, use the `with_tag` parameter:
    resource, tag = book.resources.resolve_href(
        "../Text/nav.xhtml#toc",
        with_tag=True,
        relative_to="Styles/style.css",
    )
    assert resource.filename == "Text/nav.xhtml"
    assert isinstance(tag, bs4.Tag)
    assert tag["id"] == "toc"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epublib-0.1.3.tar.gz (3.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

epublib-0.1.3-py3-none-any.whl (82.1 kB view details)

Uploaded Python 3

File details

Details for the file epublib-0.1.3.tar.gz.

File metadata

  • Download URL: epublib-0.1.3.tar.gz
  • Upload date:
  • Size: 3.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.15

File hashes

Hashes for epublib-0.1.3.tar.gz
Algorithm Hash digest
SHA256 005e884400234f6bea8026f3368bb48d7cc0af8025d181b3efdc1142c602e80b
MD5 6540376c65dad36b749400e635b9e3c4
BLAKE2b-256 811cbba2beca8f8089548d8742024d86e879c3924013b321c185c3c2c1a6461d

See more details on using hashes here.

File details

Details for the file epublib-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: epublib-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 82.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.15

File hashes

Hashes for epublib-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6546141f82a552b54c82b7c65cea845c0899b156150becc1f9ed748220c59609
MD5 cdf1264f29b5383fb88ee20807e67087
BLAKE2b-256 2315ae4dffbd53bd7dde9e4d8a4f3f248def98bf47e6e7f3dce710b7c8dc18f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page