Convert common book file types to text for machine learning
Project description
Convert Ebook File
Overview
This Python script provides functionality for converting various ebook file formats (EPUB, DOCX, PDF, TXT) into a standardized text format. The script processes each file, identifying chapters, and replaces chapter headers with asterisks. It also performs OCR (Optical Character Recognition) for image-based text using GPT-4o and standardizes the text by desmartening punctuation.
Features
- File Format Support: Handles EPUB, DOCX, PDF, and TXT formats.
- Chapter Identification: Detects and marks chapter breaks.
- OCR Capability: Converts text from images using OCR.
- Text Standardization: Replaces smart punctuation with ASCII equivalents.
Requirements
To run this script, you need Python 3.9 or above and the following packages:
python-docxopenaipython-dotenvbs4pdfminer.sixpillow
Usage
- Ensure all dependencies are installed.
- Set your environment variable for the OpenAI API key.
- Place your ebook files in a known directory.
- Run the script with the path to the ebook file and a metadata dictionary with keys of 'title' and 'author' as arguments.
- set
save_fileto False, if you want a string returned. - provide a Path object of a file name to be written to, to use a custom output filename.
Functions
convert_file(file_path: Path, metadata: dict, *, save_file: bool = True, save_path: Optional[Path] = None) -> Union[str, None]: Main function to convert an ebook file to text.
Contributing
Contributions to this project are welcome. Please use Ruff for formatting to ensure that your code follows the existing style for consistency, and follow the ProsePal Open Source Contributor's Code of Contact.
TODO
- Increase test coverage
- Tests for text converter
- More edge cases and failure states
- Better handling of ebooklib dependency
- Add additional AI models for OCR as plugins
- Explore additional filetypes
- Other options for determining filetype
License
This project is licensed by ProsePal LLC under the MIT license
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ebook2text-2.0.3.tar.gz.
File metadata
- Download URL: ebook2text-2.0.3.tar.gz
- Upload date:
- Size: 29.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.5.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5905b368e8eb06e2b563dfeb04fb9be971e540a675d8d0bdeceb87fbe5e50d32
|
|
| MD5 |
58fbee4638e1181926f922b5a6afc237
|
|
| BLAKE2b-256 |
6ac40fae5a4882ca00d949d89033576775a26852b7f0144cdea758bddd89d21c
|
File details
Details for the file ebook2text-2.0.3-py3-none-any.whl.
File metadata
- Download URL: ebook2text-2.0.3-py3-none-any.whl
- Upload date:
- Size: 28.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.5.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
340588ff5dafab8abca9fb58f1aba454b1bbd962da316d879dcb462b8186d89c
|
|
| MD5 |
5026b7788e5e2a7e062014fdf1a24b70
|
|
| BLAKE2b-256 |
883e7c70f43f95deac24aa723639046c72c01c0b634e9463ca3650326e315b16
|