Skip to main content

A package to convert PDFs to images and perform OCR.

Project description

MangoCR

Turn obnoxious PDFs into a tasty mango slices!

Turn obnoxious PDFs into a tasty mango slices!

MangoCR is a Python package that converts PDF files to text using Optical Character Recognition (OCR). It processes single or multiple PDF files and outputs the results in a clean Markdown format.

Features

  • Process single or multiple PDF files in one go
  • High-quality OCR using Tesseract
  • Markdown-formatted output
  • Progress tracking for each PDF and page
  • Maintains document structure with clear page separation
  • High-resolution image conversion (300 DPI) for optimal OCR results

Prerequisites

  • Python 3.6 or higher
  • Tesseract OCR engine

Installation

  1. Clone the repository and install the required Python packages:
pip install mangoCR
  1. Install Tesseract OCR engine:

Windows

  1. Download the Tesseract installer from the official GitHub releases page
  2. Run the installer. Make sure to note the installation path
  3. Add Tesseract to your system PATH:
    • Open System Properties → Advanced → Environment Variables
    • Under System Variables, find and select "Path"
    • Click "Edit" and add the Tesseract installation directory (typically C:\Program Files\Tesseract-OCR)
    • Click "OK" to save

macOS

Using Homebrew:

brew install tesseract

Google Colab

!apt install tesseract-ocr

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr

Linux (Fedora)

sudo dnf install tesseract

Usage

The package provides a simple function pdf2image_ocr that can process either a single PDF or multiple PDFs.

Basic Usage

from mangoCR import pdf2image_ocr

# Process a single PDF
pdf2image_ocr("path/to/your/document.pdf")

# Process multiple PDFs
pdf_list = [
    "path/to/first.pdf",
    "path/to/second.pdf",
    "path/to/third.pdf"
]
pdf2image_ocr(pdf_list)

Specifying Custom Output File

# Change the output file name/location
pdf2image_ocr("document.pdf", output_file="custom_output.md")

Output Format

The OCR results are saved in a Markdown file with the following structure:

# OCR Results for document1.pdf

## Page 1

[Extracted text from page 1]

## Page 2

[Extracted text from page 2]

# OCR Results for document2.pdf

## Page 1

[Extracted text from page 1]

Troubleshooting

Common Issues

  1. Tesseract Not Found Error

    EnvironmentError: Tesseract is not installed or not found in PATH
    

    Solution: Ensure Tesseract is properly installed and added to your system PATH.

  2. Low Quality OCR Results

    • Ensure your PDF is of good quality
    • The default DPI is set to 300 for optimal results
    • Consider preprocessing your PDFs if they contain complicated layouts

PDF Requirements

  • PDFs should be readable and not password-protected
  • Scanned documents should be clear and properly aligned
  • For best results, use PDFs with a resolution of at least 300 DPI

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Thanks to the Tesseract OCR team for providing the OCR engine
  • PyMuPDF team for the excellent PDF processing library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mangoCR-0.1.4.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

mangoCR-0.1.4-py3-none-any.whl (5.4 kB view details)

Uploaded Python 3

File details

Details for the file mangoCR-0.1.4.tar.gz.

File metadata

  • Download URL: mangoCR-0.1.4.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for mangoCR-0.1.4.tar.gz
Algorithm Hash digest
SHA256 6c39c284760f116a7b5475d21e6a080af9947e417c8a27cf51052dffc4eb9e69
MD5 691b85dccc10837aa819fc088e010c6c
BLAKE2b-256 3317c9dc5e67a5905c61c58f18828fe3bb667a67caf31aefa758a8ba1384126a

See more details on using hashes here.

File details

Details for the file mangoCR-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: mangoCR-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 5.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for mangoCR-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 ee016658b300338d75cddd01981af065daa0da097e64292c3166009e7d433fde
MD5 568758ec6b9cd63a12b743ea1baf14c0
BLAKE2b-256 ae98a9a81df60d30589ecd21df53f3ddfadb810c3aef59998f481b7ee279f465

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page