Skip to main content

Convert common book file types to text for machine learning

Project description

Convert Ebook File

Overview

This Python script provides functionality for converting various ebook file formats (EPUB, DOCX, PDF, TXT) into a standardized text format. The script processes each file, identifying chapters, and replaces chapter headers with asterisks. It also performs OCR (Optical Character Recognition) for image-based text using GPT-4o and standardizes the text by desmartening punctuation.

Features

  • File Format Support: Handles EPUB, DOCX, PDF, and TXT formats.
  • Chapter Identification: Detects and marks chapter breaks.
  • OCR Capability: Converts text from images using OCR.
  • Text Standardization: Replaces smart punctuation with ASCII equivalents.

Requirements

To run this script, you need Python 3.9 or above and the following packages:

  • python-docx
  • openai
  • python-dotenv
  • bs4
  • pdfminer.six
  • pillow

Usage

  1. Ensure all dependencies are installed.
  2. Set your environment variable for the OpenAI API key.
  3. Place your ebook files in a known directory.
  4. Run the script with the path to the ebook file and a metadata dictionary with keys of 'title' and 'author' as arguments.
  • set save_file to False, if you want a string returned.
  • provide a Path object of a file name to be written to, to use a custom output filename.

Functions

  • convert_file(file_path: Path, metadata: dict, *, save_file: bool = True, save_path: Optional[Path] = None) -> Union[str, None]: Main function to convert an ebook file to text.

Contributing

Contributions to this project are welcome. Please use Ruff for formatting to ensure that your code follows the existing style for consistency, and follow the ProsePal Open Source Contributor's Code of Contact.

TODO

  • Increase test coverage
    • Tests for text converter
    • More edge cases and failure states
  • Better handling of ebooklib dependency
  • Add additional AI models for OCR as plugins
  • Explore additional filetypes
  • Other options for determining filetype

License

This project is licensed by ProsePal LLC under the MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ebook2text-2.0.3.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ebook2text-2.0.3-py3-none-any.whl (28.8 kB view details)

Uploaded Python 3

File details

Details for the file ebook2text-2.0.3.tar.gz.

File metadata

  • Download URL: ebook2text-2.0.3.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.6

File hashes

Hashes for ebook2text-2.0.3.tar.gz
Algorithm Hash digest
SHA256 5905b368e8eb06e2b563dfeb04fb9be971e540a675d8d0bdeceb87fbe5e50d32
MD5 58fbee4638e1181926f922b5a6afc237
BLAKE2b-256 6ac40fae5a4882ca00d949d89033576775a26852b7f0144cdea758bddd89d21c

See more details on using hashes here.

File details

Details for the file ebook2text-2.0.3-py3-none-any.whl.

File metadata

  • Download URL: ebook2text-2.0.3-py3-none-any.whl
  • Upload date:
  • Size: 28.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.6

File hashes

Hashes for ebook2text-2.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 340588ff5dafab8abca9fb58f1aba454b1bbd962da316d879dcb462b8186d89c
MD5 5026b7788e5e2a7e062014fdf1a24b70
BLAKE2b-256 883e7c70f43f95deac24aa723639046c72c01c0b634e9463ca3650326e315b16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page