Skip to main content

A package to convert various document types to plain text.

Project description

Document Converter

Document Converter is a Python package designed to convert various document types (such as PDF, DOCX, and images) into plain text. This package supports preprocessing of images to enhance OCR (Optical Character Recognition) results and can handle multiple file types without needing to specify the document type manually.

Features

  • Convert PDFs, DOCX, and images to plain text.
  • Image preprocessing using OpenCV functions like grayscale conversion, blurring, thresholding, and edge detection.
  • Automatic selection of the appropriate converter based on the document type.
  • Easy to use with a single method to convert documents.

Installation

You can install the package via pip:

pip install document-converter

you also need to install some system dependencies:

On Linux

sudo apt-get install -y poppler-utils
sudo apt-get install -y tesseract-ocr

On Windows

  1. Poppler:

    • Download the latest Poppler binaries from Poppler for Windows.
    • Extract the contents and add the bin folder to your system's PATH environment variable.
  2. Tesseract:

    • Download the latest Tesseract installer from Tesseract at UB Mannheim.
    • Run the installer and follow the prompts.
    • Make sure to add Tesseract to your system's PATH during installation.

On macOS

brew install poppler
brew install tesseract

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textify_docs-1.0.0.tar.gz (20.1 kB view hashes)

Uploaded Source

Built Distribution

textify_docs-1.0.0-py3-none-any.whl (21.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page