Skip to main content

A Python package for parsing chapters from EPUBs.

Project description

EpubChapterize

A tool to split out chapters from ePub documents. Initially just for Project Gutenberg ePub3s.

Setup

To set up the project, follow these steps:

  1. Clone the repository:

    git clone https://github.com/yourusername/EpubChapterize.git
    cd EpubChapterize
    
  2. Create a virtual environment:

    python -m venv venv
    
  3. Activate the virtual environment:

    • On macOS/Linux:
      source venv/bin/activate
      
    • On Windows:
      venv\Scripts\activate
      
  4. Install the required dependencies:

    pip install -r requirements.txt
    
  5. Install additional language models for spaCy (if needed):

    Depending on the languages you plan to process, you may need to install specific spaCy language models. Use the following commands to install them:

    • For English:
      python -m spacy download en_core_web_trf
      
    • For German:
      python -m spacy download de_dep_news_trf
      
    • For Italian:
      python -m spacy download it_core_news_trf
      
    • For Spanish:
      python -m spacy download es_dep_news_trf
      
    • For French:
      python -m spacy download fr_dep_news_trf
      

    If you are not using spacy then skip this step

Usage

This tool is primarily designed to extract chapters from Project Gutenberg ePub3 files. It works by analyzing the navigation structure, matching headers, and attempting to identify chapter divisions. Note that it may also include some preamble content, and its accuracy is not guaranteed.

To use the tool, run:

python chapterize.py /path/to/your/epub/files/ 

or

python chapterize.py 

which will use the books directory by default

Notes

  • The tool is not perfect and may require manual adjustments to the output.
  • It is currently a standalone script but may be packaged in the future.
  • Feel free to fork the repository and modify it as needed.

Contributing

If you encounter any issues, please raise a ticket in the repository. Contributions are welcome!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

epubchapterize-0.1.0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

epubchapterize-0.1.0-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file epubchapterize-0.1.0.tar.gz.

File metadata

  • Download URL: epubchapterize-0.1.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for epubchapterize-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f6663ea21e6a4d545adeb4e0165cacd81eaaeaf2c4818c9aa9654e003abf6080
MD5 405b45f4bb52632a82b50de5fdd4583d
BLAKE2b-256 829ad49e7a5d07813bda511c956c38bc37204943538936a205055b942dada789

See more details on using hashes here.

File details

Details for the file epubchapterize-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: epubchapterize-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for epubchapterize-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 de7c41b6202bc6fd10f61ca335c65abdcf433da109ec86b8ca0c86a2f0adb4e9
MD5 68a8a3f0ca40fb3ead05aa8749e281a0
BLAKE2b-256 6d09728725f1dfc3e6433bc4a7fce1a9851908c885a45e50b54c5c98c0367100

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page