Skip to main content

A package to convert PDFs to images and perform OCR.

Project description

MangoCR

Turn obnoxious PDFs into a tasty mango slices!

Turn obnoxious PDFs into a tasty mango slices!

MangoCR is a Python package that converts PDF files to text using Optical Character Recognition (OCR). It processes single or multiple PDF files and outputs the results in a clean Markdown format.

Features

  • Process single or multiple PDF files in one go
  • High-quality OCR using Tesseract
  • Markdown-formatted output
  • Progress tracking for each PDF and page
  • Maintains document structure with clear page separation
  • High-resolution image conversion (300 DPI) for optimal OCR results

Prerequisites

  • Python 3.6 or higher
  • Tesseract OCR engine

Installation

  1. Clone the repository and install the required Python packages:
pip install mangoCR
  1. Install Tesseract OCR engine:

Windows

  1. Download the Tesseract installer from the official GitHub releases page
  2. Run the installer. Make sure to note the installation path
  3. Add Tesseract to your system PATH:
    • Open System Properties → Advanced → Environment Variables
    • Under System Variables, find and select "Path"
    • Click "Edit" and add the Tesseract installation directory (typically C:\Program Files\Tesseract-OCR)
    • Click "OK" to save

macOS

Using Homebrew:

brew install tesseract

Google Colab

!apt install tesseract-ocr

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr

Linux (Fedora)

sudo dnf install tesseract

Usage

The package provides a simple function pdf2image_ocr that can process either a single PDF or multiple PDFs.

Basic Usage

from mangoCR import pdf2image_ocr

# Process a single PDF
pdf2image_ocr("path/to/your/document.pdf")

# Process multiple PDFs
pdf_list = [
    "path/to/first.pdf",
    "path/to/second.pdf",
    "path/to/third.pdf"
]
pdf2image_ocr(pdf_list)

Specifying Custom Output File

# Change the output file name/location
pdf2image_ocr("document.pdf", output_file="custom_output.md")

Output Format

The OCR results are saved in a Markdown file with the following structure:

# OCR Results for document1.pdf

## Page 1

[Extracted text from page 1]

## Page 2

[Extracted text from page 2]

# OCR Results for document2.pdf

## Page 1

[Extracted text from page 1]

Troubleshooting

Common Issues

  1. Tesseract Not Found Error

    EnvironmentError: Tesseract is not installed or not found in PATH
    

    Solution: Ensure Tesseract is properly installed and added to your system PATH.

  2. Low Quality OCR Results

    • Ensure your PDF is of good quality
    • The default DPI is set to 300 for optimal results
    • Consider preprocessing your PDFs if they contain complicated layouts

PDF Requirements

  • PDFs should be readable and not password-protected
  • Scanned documents should be clear and properly aligned
  • For best results, use PDFs with a resolution of at least 300 DPI

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Thanks to the Tesseract OCR team for providing the OCR engine
  • PyMuPDF team for the excellent PDF processing library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mangoCR-0.1.3.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

mangoCR-0.1.3-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file mangoCR-0.1.3.tar.gz.

File metadata

  • Download URL: mangoCR-0.1.3.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for mangoCR-0.1.3.tar.gz
Algorithm Hash digest
SHA256 74d5ef1ca6ae371c6b7254c651dc8b61b2e35beee7537a03291b8da952f0a881
MD5 7e86dac444035c24b81203de0c869dfd
BLAKE2b-256 57d80804792b9f5b5b038885ce3d4b30c3e0add2ea62f9541bd017a53f4c1dd9

See more details on using hashes here.

File details

Details for the file mangoCR-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: mangoCR-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for mangoCR-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 df970fb8f7ad49ea95097dce3a02006e14f3d64e875f9638e18e1e9435a12b24
MD5 fa088dab3901f9f50d98e18f296b9562
BLAKE2b-256 db0c51950f378a94eec8f8c831c743dbf1009364d65e4962eed6495eed99dc09

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page