Extract text and images from EPUB files and reconstruct as Markdown
Project description
epub-content-extractor
EPUBファイルのテキストと画像を抽出し、Markdownファイル群として出力するCLIツール。FastMCPによるMCPサーバとしても動作します。
インストール
pip install epub-content-extractor
uvxで直接実行:
uvx epub-content-extractor --help
CLIの使い方
epub-extract INPUT.epub [OUTPUT_DIR]
INPUT.epub: 入力EPUBファイルのパス(必須)OUTPUT_DIR: 出力先ディレクトリ(省略時は{epub_dir}/{epub_stem}/)
出力例
output/
├── chapter_001.md
├── chapter_002.md
└── images/
└── fig001.png
各 .md ファイルはYAML Front Matter付き:
---
title: "書籍タイトル"
authors:
- "著者名"
language: ja
publisher: "出版社"
identifier: "urn:isbn:..."
epub_layout: fixed-layout
page_progression_direction: rtl
chapter_title: "第1章"
spine_order: 1
---
MCPサーバとして使う
epub-content-extractor
MCPツール
| ツール名 | 説明 |
|---|---|
extract_epub |
EPUBの全コンテンツをMarkdownとして抽出 |
get_epub_metadata |
EPUBのメタデータを取得(抽出なし) |
list_epub_spine |
スパインアイテム(章)を一覧 |
対応EPUBレイアウト
- リフロー型: HTML構造から自然な読み順でテキスト抽出
- フィックス型:
position: absoluteCSS座標によるソート(RTL/LTR対応) - AHL型: スパインアイテムごとにフィックス型/リフロー型を判定
TestPyPI での動作確認
リリース前に TestPyPI へアップロードされたパッケージを uvx で検証する。
TestPyPI には lxml>=5.0 が存在しないため、--extra-index-url で PyPI を補助インデックスとして追加し、--index-strategy unsafe-best-match で全インデックスから最適バージョンを選択させる必要がある。
# MCP サーバーとして起動確認
uvx --from "epub-content-extractor" \
--index "https://test.pypi.org/simple/" \
--extra-index-url "https://pypi.org/simple/" \
--index-strategy unsafe-best-match \
epub-content-extractor
# CLI ツールとして動作確認
uvx --from "epub-content-extractor" \
--index "https://test.pypi.org/simple/" \
--extra-index-url "https://pypi.org/simple/" \
--index-strategy unsafe-best-match \
epub-extract <EPUBファイルパス>
# バージョンを指定する場合
uvx --from "epub-content-extractor==0.2.2" \
--index "https://test.pypi.org/simple/" \
--extra-index-url "https://pypi.org/simple/" \
--index-strategy unsafe-best-match \
epub-content-extractor
開発
uv sync --group dev
uv run pytest tests/ -v
uv run ruff check .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file epub_content_extractor-0.2.5.tar.gz.
File metadata
- Download URL: epub_content_extractor-0.2.5.tar.gz
- Upload date:
- Size: 155.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04a3cbde9c9451a2f7bd6aefd2c5cdff99fdded4562671b7ed7465ac942361bc
|
|
| MD5 |
63bf26e61c8964f15e31fc588d8bbcd2
|
|
| BLAKE2b-256 |
50bf79060a337074d75d73e533b1dc544d697555691c9019c7eb9936157670c2
|
Provenance
The following attestation bundles were made for epub_content_extractor-0.2.5.tar.gz:
Publisher:
publish.yml on HizZaniya/epub-content-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
epub_content_extractor-0.2.5.tar.gz -
Subject digest:
04a3cbde9c9451a2f7bd6aefd2c5cdff99fdded4562671b7ed7465ac942361bc - Sigstore transparency entry: 1808354239
- Sigstore integration time:
-
Permalink:
HizZaniya/epub-content-extractor@559957bf194c43e94cb8784634af74832d4340ee -
Branch / Tag:
refs/heads/main - Owner: https://github.com/HizZaniya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@559957bf194c43e94cb8784634af74832d4340ee -
Trigger Event:
workflow_run
-
Statement type:
File details
Details for the file epub_content_extractor-0.2.5-py3-none-any.whl.
File metadata
- Download URL: epub_content_extractor-0.2.5-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc2f0d8785bbda3992338bcf4ba23a1dc9797b60400c131dbbdaa1141d9c0bf9
|
|
| MD5 |
57f670afb399232753b4ea0c26d00d83
|
|
| BLAKE2b-256 |
1aa160dd2f8c2bad29eb7985e82ff9bca5cfce892016c271e645119dce2d5f42
|
Provenance
The following attestation bundles were made for epub_content_extractor-0.2.5-py3-none-any.whl:
Publisher:
publish.yml on HizZaniya/epub-content-extractor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
epub_content_extractor-0.2.5-py3-none-any.whl -
Subject digest:
fc2f0d8785bbda3992338bcf4ba23a1dc9797b60400c131dbbdaa1141d9c0bf9 - Sigstore transparency entry: 1808354257
- Sigstore integration time:
-
Permalink:
HizZaniya/epub-content-extractor@559957bf194c43e94cb8784634af74832d4340ee -
Branch / Tag:
refs/heads/main - Owner: https://github.com/HizZaniya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@559957bf194c43e94cb8784634af74832d4340ee -
Trigger Event:
workflow_run
-
Statement type: