A Python tool for extracting table of contents from EPUB files with hierarchical structure support
Project description
EPUB TOC
A Python tool for extracting table of contents from EPUB files with hierarchical structure support.
Features
- Multiple extraction methods support (NCX, epub_meta, OPF)
- Automatic best method selection
- Hierarchical TOC structure preservation
- Russian and English language support
- JSON output format
- Detailed logging
- EPUB file analysis reports
Installation
pip install epub_toc
Usage
As a module
from epub_toc import EPUBTOCParser
# Create parser
parser = EPUBTOCParser('path/to/book.epub')
# Extract TOC
toc = parser.extract_toc()
# Print to console
parser.print_toc()
# Save to JSON
parser.save_toc_to_json('output.json')
From command line
epub-toc path/to/book.epub
EPUB File Analysis
To analyze all EPUB files in tests/data/epub_samples directory:
python tests/integration/test_epub_analysis.py
Analysis results are saved in reports/ directory:
epub_analysis_YYYYMMDD_HHMMSS.json- detailed report in JSON formatepub_analysis_YYYYMMDD_HHMMSS.txt- brief report in text formattoc/*.json- extracted TOCs for each EPUB file
Report structure:
-
JSON report contains:
- Overall statistics for all files
- Extraction methods success rate
- Detailed results for each file
- Links to extracted TOC files
-
Text report includes:
- Brief statistics
- Information about each file
- Paths to extracted TOCs
-
TOC files:
- Saved in
toc/subdirectory - Named as
book_name_toc.json - Contain complete TOC in JSON format
- Saved in
Output Format
TOC is saved in JSON format with the following structure:
{
"metadata": {
"title": "Book Title",
"authors": ["Author 1", "Author 2"],
"publisher": "Publisher Name",
"publication_date": "2024-01-01",
"language": "en",
"description": "Book description",
"cover_image_path": "path/to/cover.jpg",
"isbn": "978-3-16-148410-0",
"rights": "Copyright information",
"series": "Series Name",
"series_index": 1,
"identifiers": {
"isbn13": "978-3-16-148410-0",
"uuid": "550e8400-e29b-41d4-a716-446655440000"
},
"subjects": ["Fiction", "Adventure"],
"file_size": 1234567,
"file_name": "book.epub"
},
"toc": [
{
"title": "Chapter 1",
"href": "chapter1.html",
"level": 0,
"children": [
{
"title": "Section 1.1",
"href": "chapter1.html#section1",
"level": 1,
"children": []
}
]
}
]
}
All metadata fields are optional and will be omitted if not available in the EPUB file.
Testing
The module has been successfully tested on various EPUB files:
- Russian books (NCX method)
- English books (epub_meta method)
- Files with different TOC structures
- Files of different sizes (from 400KB to 8MB)
Requirements
- Python 3.7+
- epub_meta>=0.0.7
- lxml>=4.9.3
- beautifulsoup4>=4.12.2
Contributing
We welcome contributions! If you'd like to help:
- Fork the repository
- Create a branch for your changes
- Make changes and add tests
- Ensure all tests pass
- Create a Pull Request
See CONTRIBUTING.md for details.
Security
If you discover a security vulnerability, please DO NOT create a public issue. Instead, send a report following the instructions in SECURITY.md
License
This project is licensed under the MIT License. See LICENSE file for details.
Roadmap
- Additional EPUB format support
- Improved complex hierarchical structure handling
- Integration with popular e-readers
- Web service API
- Additional language support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file epub_toc-1.0.0.tar.gz.
File metadata
- Download URL: epub_toc-1.0.0.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a39a3109b8c4f6120e0c11ec794cc7f72018f0c957a6d09ed77daaf0b3cf0f17
|
|
| MD5 |
d6c9bc5e7364326198d633866c92d196
|
|
| BLAKE2b-256 |
2ef8bfcc5667925c9d39d0320757bece1b5a10402c7a8ed0db41a26ca1e9b23a
|