Skip to main content

A package to convert various document types to plain text.

Project description

Document Converter

Document Converter is a Python package designed to convert various document types (such as PDF, DOCX, and images) into plain text. This package supports preprocessing of images to enhance OCR (Optical Character Recognition) results and can handle multiple file types without needing to specify the document type manually.

Features

  • Convert PDFs, DOCX, and images to plain text.
  • Image preprocessing using OpenCV functions like grayscale conversion, blurring, thresholding, and edge detection.
  • Automatic selection of the appropriate converter based on the document type.
  • Easy to use with a single method to convert documents.

Installation

You can install the package via pip:

pip install document-converter

you also need to install some system dependencies:

On Linux

sudo apt-get install -y poppler-utils
sudo apt-get install -y tesseract-ocr

On Windows

  1. Poppler:

    • Download the latest Poppler binaries from Poppler for Windows.
    • Extract the contents and add the bin folder to your system's PATH environment variable.
  2. Tesseract:

    • Download the latest Tesseract installer from Tesseract at UB Mannheim.
    • Run the installer and follow the prompts.
    • Make sure to add Tesseract to your system's PATH during installation.

On macOS

brew install poppler
brew install tesseract

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textify_docs-1.0.1.tar.gz (22.4 kB view details)

Uploaded Source

Built Distribution

textify_docs-1.0.1-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file textify_docs-1.0.1.tar.gz.

File metadata

  • Download URL: textify_docs-1.0.1.tar.gz
  • Upload date:
  • Size: 22.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for textify_docs-1.0.1.tar.gz
Algorithm Hash digest
SHA256 116481ea96aa5964d5dbbcc32d5897f2bf83486781ff6674af5f6291c66bf8dd
MD5 81431dd55c11fe88f8d5b095ab525b2c
BLAKE2b-256 281b51b443136dd1e60281e1375517fa5bb60dcc3b8c32dd85b00e91314d9b06

See more details on using hashes here.

File details

Details for the file textify_docs-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: textify_docs-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for textify_docs-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8fb0035144d371dda21c6470085a7ed6069f7246c0e4507298dfd7363ba3e055
MD5 e669e1af3b244228180073dda581e0ea
BLAKE2b-256 a03716e3bed23171c8b0e7f3926064c1a77426b085e1cc015803e1cfe0f57958

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page