A package to convert PDFs to images and perform OCR.
Project description
MangoCR
MangoCR is a Python package that converts PDF files to text using Optical Character Recognition (OCR). It processes single or multiple PDF files and outputs the results in a clean Markdown format.
Features
- Process single or multiple PDF files in one go
- High-quality OCR using Tesseract
- Markdown-formatted output
- Progress tracking for each PDF and page
- Maintains document structure with clear page separation
- High-resolution image conversion (300 DPI) for optimal OCR results
Prerequisites
- Python 3.6 or higher
- Tesseract OCR engine
Installation
- Clone the repository and install the required Python packages:
pip install mangoCR
- Install Tesseract OCR engine:
Windows
- Download the Tesseract installer from the official GitHub releases page
- Run the installer. Make sure to note the installation path
- Add Tesseract to your system PATH:
- Open System Properties → Advanced → Environment Variables
- Under System Variables, find and select "Path"
- Click "Edit" and add the Tesseract installation directory (typically
C:\Program Files\Tesseract-OCR
) - Click "OK" to save
macOS
Using Homebrew:
brew install tesseract
Google Colab
!apt install tesseract-ocr
Linux (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install tesseract-ocr
Linux (Fedora)
sudo dnf install tesseract
Usage
The package provides a simple function pdf2image_ocr
that can process either a single PDF or multiple PDFs.
Basic Usage
from mangoCR import pdf2image_ocr
# Process a single PDF
pdf2image_ocr("path/to/your/document.pdf")
# Process multiple PDFs
pdf_list = [
"path/to/first.pdf",
"path/to/second.pdf",
"path/to/third.pdf"
]
pdf2image_ocr(pdf_list)
Specifying Custom Output File
# Change the output file name/location
pdf2image_ocr("document.pdf", output_file="custom_output.md")
Output Format
The OCR results are saved in a Markdown file with the following structure:
# OCR Results for document1.pdf
## Page 1
[Extracted text from page 1]
## Page 2
[Extracted text from page 2]
# OCR Results for document2.pdf
## Page 1
[Extracted text from page 1]
Troubleshooting
Common Issues
-
Tesseract Not Found Error
EnvironmentError: Tesseract is not installed or not found in PATH
Solution: Ensure Tesseract is properly installed and added to your system PATH.
-
Low Quality OCR Results
- Ensure your PDF is of good quality
- The default DPI is set to 300 for optimal results
- Consider preprocessing your PDFs if they contain complicated layouts
PDF Requirements
- PDFs should be readable and not password-protected
- Scanned documents should be clear and properly aligned
- For best results, use PDFs with a resolution of at least 300 DPI
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Thanks to the Tesseract OCR team for providing the OCR engine
- PyMuPDF team for the excellent PDF processing library
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mangoCR-0.1.2.tar.gz
.
File metadata
- Download URL: mangoCR-0.1.2.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0fb7e2dd7fa7753a874e915d1a22661c144c15a392e784c5562faa121f059d74 |
|
MD5 | 47b571b49dcb747982e0bbe041f5dde6 |
|
BLAKE2b-256 | a8ea79e2d0861e3818a44190e3d7e1f7dbde1bc87283a2beb61fa09d14a705e1 |
File details
Details for the file mangoCR-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: mangoCR-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 02e23b26ccf6b7ecf2065105bc19db830891e2d834958e5960cd3a75a6c2e01d |
|
MD5 | d7e00c7933ac48118656d19a51770f0a |
|
BLAKE2b-256 | 6fb90ed6ebe5826f14e6ba2772658e6cc975429f3babe2b8d859bbefd9740e3b |