Skip to main content

A package to convert PDFs to images and perform OCR.

Project description

MangoCR

Obnoxious PDFs being turned into a tasty mango MangoCR is a Python package that converts PDF files to text using Optical Character Recognition (OCR). It processes single or multiple PDF files and outputs the results in a clean Markdown format.

Features

  • Process single or multiple PDF files in one go
  • High-quality OCR using Tesseract
  • Markdown-formatted output
  • Progress tracking for each PDF and page
  • Maintains document structure with clear page separation
  • High-resolution image conversion (300 DPI) for optimal OCR results

Prerequisites

  • Python 3.6 or higher
  • Tesseract OCR engine

Installation

  1. Clone the repository and install the required Python packages:
pip install mangoCR
  1. Install Tesseract OCR engine:

Windows

  1. Download the Tesseract installer from the official GitHub releases page
  2. Run the installer. Make sure to note the installation path
  3. Add Tesseract to your system PATH:
    • Open System Properties → Advanced → Environment Variables
    • Under System Variables, find and select "Path"
    • Click "Edit" and add the Tesseract installation directory (typically C:\Program Files\Tesseract-OCR)
    • Click "OK" to save

macOS

Using Homebrew:

brew install tesseract

Google Colab

!apt install tesseract-ocr

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr

Linux (Fedora)

sudo dnf install tesseract

Usage

The package provides a simple function pdf2image_ocr that can process either a single PDF or multiple PDFs.

Basic Usage

from mangoCR import pdf2image_ocr

# Process a single PDF
pdf2image_ocr("path/to/your/document.pdf")

# Process multiple PDFs
pdf_list = [
    "path/to/first.pdf",
    "path/to/second.pdf",
    "path/to/third.pdf"
]
pdf2image_ocr(pdf_list)

Specifying Custom Output File

# Change the output file name/location
pdf2image_ocr("document.pdf", output_file="custom_output.md")

Output Format

The OCR results are saved in a Markdown file with the following structure:

# OCR Results for document1.pdf

## Page 1

[Extracted text from page 1]

## Page 2

[Extracted text from page 2]

# OCR Results for document2.pdf

## Page 1

[Extracted text from page 1]

Troubleshooting

Common Issues

  1. Tesseract Not Found Error

    EnvironmentError: Tesseract is not installed or not found in PATH
    

    Solution: Ensure Tesseract is properly installed and added to your system PATH.

  2. Low Quality OCR Results

    • Ensure your PDF is of good quality
    • The default DPI is set to 300 for optimal results
    • Consider preprocessing your PDFs if they contain complicated layouts

PDF Requirements

  • PDFs should be readable and not password-protected
  • Scanned documents should be clear and properly aligned
  • For best results, use PDFs with a resolution of at least 300 DPI

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Thanks to the Tesseract OCR team for providing the OCR engine
  • PyMuPDF team for the excellent PDF processing library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mangoCR-0.1.2.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

mangoCR-0.1.2-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file mangoCR-0.1.2.tar.gz.

File metadata

  • Download URL: mangoCR-0.1.2.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for mangoCR-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0fb7e2dd7fa7753a874e915d1a22661c144c15a392e784c5562faa121f059d74
MD5 47b571b49dcb747982e0bbe041f5dde6
BLAKE2b-256 a8ea79e2d0861e3818a44190e3d7e1f7dbde1bc87283a2beb61fa09d14a705e1

See more details on using hashes here.

File details

Details for the file mangoCR-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: mangoCR-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for mangoCR-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 02e23b26ccf6b7ecf2065105bc19db830891e2d834958e5960cd3a75a6c2e01d
MD5 d7e00c7933ac48118656d19a51770f0a
BLAKE2b-256 6fb90ed6ebe5826f14e6ba2772658e6cc975429f3babe2b8d859bbefd9740e3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page