Skip to main content

Convert common book file types to text for machine learning

Project description

Convert Ebook File

Overview

This Python script provides functionality for converting various ebook file formats (EPUB, DOCX, PDF, TXT) into a standardized text format. The script processes each file, identifying chapters, and replaces chapter headers with asterisks. It also performs OCR (Optical Character Recognition) for image-based text using GPT-4o and standardizes the text by desmartenizing punctuation.

Features

  • File Format Support: Handles EPUB, DOCX, PDF, and TXT formats.
  • Chapter Identification: Detects and marks chapter breaks.
  • OCR Capability: Converts text from images using OCR.
  • Text Standardization: Replaces smart punctuation with ASCII equivalents.

Requirements

To run this script, you need Python 3.8 or above and the following packages:

  • python-docx
  • openai
  • python-dotenv
  • bs4
  • ebooklib
  • pdfminer.six
  • pillow

Usage

  1. Ensure all dependencies are installed.
  2. Set your environment variable for the OpenAI API key.
  3. Place your ebook files in a known directory.
  4. Run the script with the path to the ebook file and a metadata dictionary with keys of 'title' and 'author' as arguments.

Functions

  • convert_file(file_path: str, metadata: dict) -> str: Main function to convert an ebook file to text.

Contributing

Contributions to this project are welcome. Please ensure that your code follows the existing style for consistency.

License

This project is licensed by ProsePal LLC under the MIT license

Version History

  • v0.1.0 (Release date: November 30, 2023)

    • Initial release
  • v0.1.1 (Release date: December 2, 2023)

    • fixed false positives for is_number
  • v0.2.0 (Release date: December 3, 2023)

    • Conversion of docx files
  • v0.3.0 (Release date: December 8, 2023)

    • Conversion of PDF files
  • v0.3.1 (Release date: Januar 23, 2024)

    • fixed concantation of text in pdf conversion
    • updated pillow version to secure version
  • v1.0.0 (Release date: January 23, 2024)

    • created library instead of single module
  • v1.0.1 (Release date: March 13, 2024)

    • setup.py and requirements.txt typo fixed
  • v1.0.2 (Release date: May 17, 2024)

    • added tests, fixex minor typos
  • v1.1.0 (Release date: May 30, 2024)

    • Change to abstract factory pattern
  • v1.1.1 (Release date: May 31, 2024)

    • Pull current version of ebooklib from Github and folded it into library since package repo out of date
  • v1.1.2 (Release date: May 31, 2024)

    • FIX: Put ebooklib in correct directory.
  • v1.1.3 (Release date: October 27, 2024)

    • FIX: Initialize logging

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ebook2text-1.1.3.tar.gz (37.6 kB view details)

Uploaded Source

Built Distribution

ebook2text-1.1.3-py3-none-any.whl (44.1 kB view details)

Uploaded Python 3

File details

Details for the file ebook2text-1.1.3.tar.gz.

File metadata

  • Download URL: ebook2text-1.1.3.tar.gz
  • Upload date:
  • Size: 37.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for ebook2text-1.1.3.tar.gz
Algorithm Hash digest
SHA256 bb353028a52a8c64c4dab30d78704723043c2d939fc36a2fb1ff3d0e1b3dab9a
MD5 082a4c91a19c1e228bb985e8156f228a
BLAKE2b-256 9919cd629287bbad4b83c7ce21e3328fe1a57513fbb859d842e8bec3cc686053

See more details on using hashes here.

File details

Details for the file ebook2text-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: ebook2text-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 44.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for ebook2text-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 bf3831a9c30d3338998a8da7fbe01527fd5baaf6e5ed551edac6dc011b3899b4
MD5 222e846be6e4920d56a1a79289996cbe
BLAKE2b-256 a14c77409ffe6dfa9d0f578fbb91773ccb3255001242ae86a4506f3f4f130006

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page