A package to convert various document types to plain text.
Project description
Document Converter
Document Converter is a Python package designed to convert various document types (such as PDF, DOCX, and images) into plain text. This package supports preprocessing of images to enhance OCR (Optical Character Recognition) results and can handle multiple file types without needing to specify the document type manually.
Features
- Convert PDFs, DOCX, and images to plain text.
- Image preprocessing using OpenCV functions like grayscale conversion, blurring, thresholding, and edge detection.
- Automatic selection of the appropriate converter based on the document type.
- Easy to use with a single method to convert documents.
Installation
You can install the package via pip:
pip install document-converter
you also need to install some system dependencies:
On Linux
sudo apt-get install -y poppler-utils
sudo apt-get install -y tesseract-ocr
On Windows
-
Poppler:
- Download the latest Poppler binaries from Poppler for Windows.
- Extract the contents and add the
bin
folder to your system's PATH environment variable.
-
Tesseract:
- Download the latest Tesseract installer from Tesseract at UB Mannheim.
- Run the installer and follow the prompts.
- Make sure to add Tesseract to your system's PATH during installation.
On macOS
brew install poppler
brew install tesseract
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
textify_docs-1.0.1.tar.gz
(22.4 kB
view details)
Built Distribution
File details
Details for the file textify_docs-1.0.1.tar.gz
.
File metadata
- Download URL: textify_docs-1.0.1.tar.gz
- Upload date:
- Size: 22.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 116481ea96aa5964d5dbbcc32d5897f2bf83486781ff6674af5f6291c66bf8dd |
|
MD5 | 81431dd55c11fe88f8d5b095ab525b2c |
|
BLAKE2b-256 | 281b51b443136dd1e60281e1375517fa5bb60dcc3b8c32dd85b00e91314d9b06 |
File details
Details for the file textify_docs-1.0.1-py3-none-any.whl
.
File metadata
- Download URL: textify_docs-1.0.1-py3-none-any.whl
- Upload date:
- Size: 24.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.12
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8fb0035144d371dda21c6470085a7ed6069f7246c0e4507298dfd7363ba3e055 |
|
MD5 | e669e1af3b244228180073dda581e0ea |
|
BLAKE2b-256 | a03716e3bed23171c8b0e7f3926064c1a77426b085e1cc015803e1cfe0f57958 |