Skip to main content

Convert common book file types to text for machine learning

Project description

Convert Ebook File

Overview

This Python script provides functionality for converting various ebook file formats (EPUB, DOCX, PDF, TXT) into a standardized text format. The script processes each file, identifying chapters, and replaces chapter headers with asterisks. It also performs OCR (Optical Character Recognition) for image-based text using GPT-4o and standardizes the text by desmartening punctuation.

Features

  • File Format Support: Handles EPUB, DOCX, PDF, and TXT formats.
  • Chapter Identification: Detects and marks chapter breaks.
  • OCR Capability: Converts text from images using OCR.
  • Text Standardization: Replaces smart punctuation with ASCII equivalents.

Requirements

To run this script, you need Python 3.9 or above and the following packages:

  • python-docx
  • openai
  • python-dotenv
  • bs4
  • pdfminer.six
  • pillow

Usage

  1. Ensure all dependencies are installed.
  2. Set your environment variable for the OpenAI API key.
  3. Place your ebook files in a known directory.
  4. Run the script with the path to the ebook file and a metadata dictionary with keys of 'title' and 'author' as arguments.
  • set save_file to False, if you want a string returned.
  • provide a Path object of a file name to be written to, to use a custom output filename.

Functions

  • convert_file(file_path: Path, metadata: dict, *, save_file: bool = True, save_path: Optional[Path] = None) -> Union[str, None]: Main function to convert an ebook file to text.

Contributing

Contributions to this project are welcome. Please use Ruff for formatting to ensure that your code follows the existing style for consistency, and follow the ProsePal Open Source Contributor's Code of Contact.

TODO

  • Increase test coverage
    • Tests for text converter
    • More edge cases and failure states
  • Better handling of ebooklib dependency
  • Add additional AI models for OCR as plugins
  • Explore additional filetypes
  • Other options for determining filetype

License

This project is licensed by ProsePal LLC under the MIT license

Version History

  • v0.1.0 (Release date: November 30, 2023)

    • Initial release
  • v0.1.1 (Release date: December 2, 2023)

    • fixed false positives for is_number
  • v0.2.0 (Release date: December 3, 2023)

    • Conversion of docx files
  • v0.3.0 (Release date: December 8, 2023)

    • Conversion of PDF files
  • v0.3.1 (Release date: January 23, 2024)

    • fixed concatenation of text in pdf conversion
    • updated pillow version to secure version
  • v1.0.0 (Release date: January 23, 2024)

    • created library instead of single module
  • v1.0.1 (Release date: March 13, 2024)

    • setup.py and requirements.txt typo fixed
  • v1.0.2 (Release date: May 17, 2024)

    • added tests, fixed minor typos
  • v1.1.0 (Release date: May 30, 2024)

    • Change to abstract factory pattern
  • v1.1.1 (Release date: May 31, 2024)

    • Pull current version of ebooklib from Github and folded it into library since package repo out of date
  • v1.1.2 (Release date: May 31, 2024)

    • FIX: Put ebooklib in correct directory.
  • v1.1.3 (Release date: October 27, 2024)

    • FIX: Initialize logging
  • v1.1.4 (Release date: November 7, 2024)

    • YANKED
  • v1.1.5 (Release date: November 7, 2024)

    • FIX: Move logging to own module
  • v1.1.6 (Release date: November 9, 2024)

    • FIX: Catch PDFSyntaxError and empty image lists, small performance improvement to run_ocr
  • v1.1.7 (Release date November 10, 2024)

    • FIX: Line concatenation issue in PDFs
  • v2.0.0 (Release date December 4, 2024)

    • REFACTOR: Converters are now packages with more streamlined constructors.
    • BREAKING FEATURE: ebook2text now takes Path objects instead of string filenames.
    • BREAKING FEATURE: Converters no longer have a ChapterSplit class. This is handled by the BookConversion class, with no more circular imports.
    • NEW FEATURE: convert_file now has optional save_file and save_path arguments to allow for custom output filenames or for a string to be returned instead.
  • v2.0.1 (Release date December 16, 2024)

    • FIX: Re-raise errors raised by PDFConverter._readfile(filename)
  • v2.0.2 (Release date December 17, 2024)

    • FIX: Add missing . to extension name for text files in converter initializer

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ebook2text-2.0.2.tar.gz (50.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ebook2text-2.0.2-py3-none-any.whl (53.4 kB view details)

Uploaded Python 3

File details

Details for the file ebook2text-2.0.2.tar.gz.

File metadata

  • Download URL: ebook2text-2.0.2.tar.gz
  • Upload date:
  • Size: 50.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.6

File hashes

Hashes for ebook2text-2.0.2.tar.gz
Algorithm Hash digest
SHA256 cdd32bd9b05981ca3c1173e793d6df28f8f0445511e416e7bb793f6b9a9e8014
MD5 ada2cd8f82eb8a529b66fcb6f6e7da16
BLAKE2b-256 b0283c8f996ea145b865f35b849d42b8b7da5e8184cc84c12ec38690e48df9b5

See more details on using hashes here.

File details

Details for the file ebook2text-2.0.2-py3-none-any.whl.

File metadata

  • Download URL: ebook2text-2.0.2-py3-none-any.whl
  • Upload date:
  • Size: 53.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.6

File hashes

Hashes for ebook2text-2.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9d9c318a053a71d263a8f630c5bbd9029c414a4c29540e8334840fadb12148f6
MD5 7f682124e105ca864e6094ff7c0c7f4c
BLAKE2b-256 cd60927d7c0bc3cbb56bcf7dfb6854e68848058f6875e466b85dd0df0d88d05b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page