Skip to main content

Pure-Python EPUB → markdown converter plugin for dikw client import.

Project description

dikw-converter-epub

Pure-Python EPUB → markdown converter plugin for dikw-core's dikw client import. Once installed alongside dikw-core, running

dikw client import book.epub

parses the EPUB locally and commits the converted markdown + assets into <base>/sources/book/.

What it produces

Given book.epub, the plugin writes:

<base>/sources/book/
├── book.md                   # H1 book title (if any), italic author, chapter H2s
└── assets/
    ├── book.epub             # original, kept as provenance
    └── <opf-relative-path>/  # extracted images, named by their OPF manifest href
        ├── images/cover.jpg
        └── images/figure-1.png

Asset paths inside assets/ match each image's href in the EPUB's OPF manifest — i.e. the path relative to the OPF file's directory. So a Calibre-produced EPUB whose cover lives at zip path OEBPS/images/cover.jpg lands at assets/images/cover.jpg (the OEBPS/ publication-root prefix is stripped automatically by the EPUB href-resolution model). A Pandoc EPUB whose images live under EPUB/media/ produces assets/media/....

Design choices (v0.1)

  • No third-party dependencies. Uses only zipfile, xml.etree.ElementTree, and html.parser from the Python stdlib. No ebooklib, no markdownify. Trade-off: ~5% of edge-case EPUBs (non-standard OPF layouts, exotic inline XHTML) may need follow-up patches.
  • Asset references use wikilink syntax (![[path|alt]]). dikw-core's md_inspect accepts both ![alt](path) and the wikilink form; the wikilink form is the only one that handles asset paths containing ( or ) — common in user-named EPUB files (book(1).epub) — and alt text containing ].
  • Fresh output_dir assumed. dikw-core's importer creates a fresh temp directory and hands it to convert(). If you're calling this plugin directly, pass an empty path you control — reusing a dirty directory will leave stale assets from a previous run.
  • One markdown file per EPUB. Chapters become H2 sections in a single <stem>.md. Per-chapter splitting is deferred to a future minor version.
  • Deterministic output. The same EPUB bytes produce byte-identical markdown + assets on every run (no timestamps, no random IDs).
  • <nav> / <header> / <footer> / <aside> / <script> / <style> are stripped during XHTML walk. Repeats-on-every-chapter nav blocks don't survive into the markdown.
  • Heading levels are shifted so that the book title is the only H1, chapter titles are H2, and a chapter's internal XHTML headings sit under that. If the EPUB has no <dc:title> metadata, the H1 line is skipped and chapters become the top level.
  • Non-UTF-8 XHTML is decoded with errors="replace". XHTML in the wild lies about its encoding often enough that strict decoding causes more pain than the occasional replacement character in output.

Install

# In a real dikw client environment:
pip install dikw-converter-epub

# Upgrade later:
pip install --upgrade dikw-converter-epub

# Pin a specific version:
pip install 'dikw-converter-epub==0.1.0'

# Uninstall — the entry-point disappears on next discovery.
pip uninstall dikw-converter-epub

# For local development from this monorepo:
pip install -e packages/dikw-converter-epub

Changelog

See CHANGELOG.md for the per-release history. Each GitHub Release also carries the same notes; published wheels and sdists are attached there for offline / air-gapped installs.

Run the tests

uv run pytest packages/dikw-converter-epub

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dikw_converter_epub-0.1.0.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dikw_converter_epub-0.1.0-py3-none-any.whl (17.9 kB view details)

Uploaded Python 3

File details

Details for the file dikw_converter_epub-0.1.0.tar.gz.

File metadata

  • Download URL: dikw_converter_epub-0.1.0.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for dikw_converter_epub-0.1.0.tar.gz
Algorithm Hash digest
SHA256 29c515f25e99b2bdfd62a1cfdd90482bfb24850ed8ae0dd19c92e54732c01f7f
MD5 065b63e647d1eb2c8fbd758c53feca95
BLAKE2b-256 7e267081eb0a11ce34c6b78691904fa822faf0ecb66e85696dde20d01b83f61a

See more details on using hashes here.

Provenance

The following attestation bundles were made for dikw_converter_epub-0.1.0.tar.gz:

Publisher: release.yml on OpenDIKW/dikw-plugins

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dikw_converter_epub-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for dikw_converter_epub-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b24062f1e82578627584bc488dac5764ecbdb27f0c8cea2fb7e3770b957e9e08
MD5 9ac5a05b5361112c0207a86bc5bc71df
BLAKE2b-256 0dc99737f6b88554c04944d6d7d762cfb1d2951eab091747ecc4595b287e757f

See more details on using hashes here.

Provenance

The following attestation bundles were made for dikw_converter_epub-0.1.0-py3-none-any.whl:

Publisher: release.yml on OpenDIKW/dikw-plugins

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page