Skip to main content

A Python tool for extracting table of contents from EPUB files with hierarchical structure support

Project description

EPUB TOC

Python Version License: MIT Code style: black

A Python tool for extracting table of contents from EPUB files with hierarchical structure support.

Features

  • Multiple extraction methods support (NCX, epub_meta, OPF)
  • Automatic best method selection
  • Hierarchical TOC structure preservation
  • Russian and English language support
  • JSON output format
  • Detailed logging
  • EPUB file analysis reports

Installation

pip install epub_toc

Usage

As a module

from epub_toc import EPUBTOCParser

# Create parser
parser = EPUBTOCParser('path/to/book.epub')

# Extract TOC
toc = parser.extract_toc()

# Print to console
parser.print_toc()

# Save to JSON
parser.save_toc_to_json('output.json')

From command line

epub-toc path/to/book.epub

EPUB File Analysis

To analyze all EPUB files in tests/data/epub_samples directory:

python tests/integration/test_epub_analysis.py

Analysis results are saved in reports/ directory:

  • epub_analysis_YYYYMMDD_HHMMSS.json - detailed report in JSON format
  • epub_analysis_YYYYMMDD_HHMMSS.txt - brief report in text format
  • toc/*.json - extracted TOCs for each EPUB file

Report structure:

  1. JSON report contains:

    • Overall statistics for all files
    • Extraction methods success rate
    • Detailed results for each file
    • Links to extracted TOC files
  2. Text report includes:

    • Brief statistics
    • Information about each file
    • Paths to extracted TOCs
  3. TOC files:

    • Saved in toc/ subdirectory
    • Named as book_name_toc.json
    • Contain complete TOC in JSON format

Output Format

TOC is saved in JSON format with the following structure:

{
  "metadata": {
    "title": "Book Title",
    "authors": ["Author 1", "Author 2"],
    "publisher": "Publisher Name",
    "publication_date": "2024-01-01",
    "language": "en",
    "description": "Book description",
    "cover_image_path": "path/to/cover.jpg",
    "isbn": "978-3-16-148410-0",
    "rights": "Copyright information",
    "series": "Series Name",
    "series_index": 1,
    "identifiers": {
      "isbn13": "978-3-16-148410-0",
      "uuid": "550e8400-e29b-41d4-a716-446655440000"
    },
    "subjects": ["Fiction", "Adventure"],
    "file_size": 1234567,
    "file_name": "book.epub"
  },
  "toc": [
    {
      "title": "Chapter 1",
      "href": "chapter1.html",
      "level": 0,
      "children": [
        {
          "title": "Section 1.1",
          "href": "chapter1.html#section1",
          "level": 1,
          "children": []
        }
      ]
    }
  ]
}

All metadata fields are optional and will be omitted if not available in the EPUB file.

Testing

The module has been successfully tested on various EPUB files:

  • Russian books (NCX method)
  • English books (epub_meta method)
  • Files with different TOC structures
  • Files of different sizes (from 400KB to 8MB)

Requirements

  • Python 3.7+
  • epub_meta>=0.0.7
  • lxml>=4.9.3
  • beautifulsoup4>=4.12.2

Contributing

We welcome contributions! If you'd like to help:

  1. Fork the repository
  2. Create a branch for your changes
  3. Make changes and add tests
  4. Ensure all tests pass
  5. Create a Pull Request

See CONTRIBUTING.md for details.

Security

If you discover a security vulnerability, please DO NOT create a public issue. Instead, send a report following the instructions in SECURITY.md

License

This project is licensed under the MIT License. See LICENSE file for details.

Roadmap

  • Additional EPUB format support
  • Improved complex hierarchical structure handling
  • Integration with popular e-readers
  • Web service API
  • Additional language support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epub_toc-1.0.0.tar.gz (14.8 kB view details)

Uploaded Source

File details

Details for the file epub_toc-1.0.0.tar.gz.

File metadata

  • Download URL: epub_toc-1.0.0.tar.gz
  • Upload date:
  • Size: 14.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.8.10

File hashes

Hashes for epub_toc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a39a3109b8c4f6120e0c11ec794cc7f72018f0c957a6d09ed77daaf0b3cf0f17
MD5 d6c9bc5e7364326198d633866c92d196
BLAKE2b-256 2ef8bfcc5667925c9d39d0320757bece1b5a10402c7a8ed0db41a26ca1e9b23a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page