Project description

Document Converter

Document Converter is a Python package designed to convert various document types (such as PDF, DOCX, and images) into plain text. This package supports preprocessing of images to enhance OCR (Optical Character Recognition) results and can handle multiple file types without needing to specify the document type manually.

Features

Convert PDFs, DOCX, and images to plain text.
Image preprocessing using OpenCV functions like grayscale conversion, blurring, thresholding, and edge detection.
Automatic selection of the appropriate converter based on the document type.
Easy to use with a single method to convert documents.

Installation

You can install the package via pip:

pip install document-converter

you also need to install some system dependencies:

On Linux

sudo apt-get install -y poppler-utils
sudo apt-get install -y tesseract-ocr

On Windows

Poppler:
- Download the latest Poppler binaries from Poppler for Windows.
- Extract the contents and add the bin folder to your system's PATH environment variable.
Tesseract:
- Download the latest Tesseract installer from Tesseract at UB Mannheim.
- Run the installer and follow the prompts.
- Make sure to add Tesseract to your system's PATH during installation.

On macOS

brew install poppler
brew install tesseract

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.1

Aug 24, 2024

This version

1.0.0

Aug 23, 2024

0.1.1

Aug 12, 2024

0.1.0

Aug 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textify_docs-1.0.0.tar.gz (20.1 kB view hashes)

Uploaded Aug 23, 2024 Source

Built Distribution

textify_docs-1.0.0-py3-none-any.whl (21.3 kB view hashes)

Uploaded Aug 23, 2024 Python 3

Hashes for textify_docs-1.0.0.tar.gz

Hashes for textify_docs-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`25405be89d687fb7363bb301c69b1cd0d17c59210f7c3160f520b0f75abd073a`
MD5	`c771709cb01d5da0b4c52f064f4abc63`
BLAKE2b-256	`279997536d32f2ec1fa3135f45055d64e38057e1916173eeb16124ee7c9b645f`

Hashes for textify_docs-1.0.0-py3-none-any.whl

Hashes for textify_docs-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e3077441130076bab12ed35bd6f80be2e833e897c479115fa9bf4d5b19e49588`
MD5	`ba0d6e4db02ecef5e4e151beda8f6cba`
BLAKE2b-256	`22f070fe002e37d23de3aac7dabe50add460716c8441b927bd80c3a381b07e48`