Skip to main content

A package to convert PDFs to images and perform OCR.

Project description

MangoCR

Obnoxious PDFs being turned into a tasty mango MangoCR is a Python package that converts PDF files to text using Optical Character Recognition (OCR). It processes single or multiple PDF files and outputs the results in a clean Markdown format.

Features

  • Process single or multiple PDF files in one go
  • High-quality OCR using Tesseract
  • Markdown-formatted output
  • Progress tracking for each PDF and page
  • Maintains document structure with clear page separation
  • High-resolution image conversion (300 DPI) for optimal OCR results

Prerequisites

  • Python 3.6 or higher
  • Tesseract OCR engine

Installation

  1. Clone the repository and install the required Python packages:
pip install mangoCR
  1. Install Tesseract OCR engine:

Windows

  1. Download the Tesseract installer from the official GitHub releases page
  2. Run the installer. Make sure to note the installation path
  3. Add Tesseract to your system PATH:
    • Open System Properties → Advanced → Environment Variables
    • Under System Variables, find and select "Path"
    • Click "Edit" and add the Tesseract installation directory (typically C:\Program Files\Tesseract-OCR)
    • Click "OK" to save

macOS

Using Homebrew:

brew install tesseract

Google Colab

</code></pre>
<h3>Linux (Ubuntu/Debian)</h3>
<pre lang="bash"><code>sudo apt-get update
sudo apt-get install tesseract-ocr

Linux (Fedora)

sudo dnf install tesseract

Usage

The package provides a simple function pdf2image_ocr that can process either a single PDF or multiple PDFs.

Basic Usage

from mangoCR import pdf2image_ocr

# Process a single PDF
pdf2image_ocr("path/to/your/document.pdf")

# Process multiple PDFs
pdf_list = [
    "path/to/first.pdf",
    "path/to/second.pdf",
    "path/to/third.pdf"
]
pdf2image_ocr(pdf_list)

Specifying Custom Output File

# Change the output file name/location
pdf2image_ocr("document.pdf", output_file="custom_output.md")

Output Format

The OCR results are saved in a Markdown file with the following structure:

# OCR Results for document1.pdf

## Page 1

[Extracted text from page 1]

## Page 2

[Extracted text from page 2]

# OCR Results for document2.pdf

## Page 1

[Extracted text from page 1]

Troubleshooting

Common Issues

  1. Tesseract Not Found Error

    EnvironmentError: Tesseract is not installed or not found in PATH
    

    Solution: Ensure Tesseract is properly installed and added to your system PATH.

  2. Low Quality OCR Results

    • Ensure your PDF is of good quality
    • The default DPI is set to 300 for optimal results
    • Consider preprocessing your PDFs if they contain complicated layouts

PDF Requirements

  • PDFs should be readable and not password-protected
  • Scanned documents should be clear and properly aligned
  • For best results, use PDFs with a resolution of at least 300 DPI

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Thanks to the Tesseract OCR team for providing the OCR engine
  • PyMuPDF team for the excellent PDF processing library

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mangoCR-0.1.1.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

mangoCR-0.1.1-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file mangoCR-0.1.1.tar.gz.

File metadata

  • Download URL: mangoCR-0.1.1.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for mangoCR-0.1.1.tar.gz
Algorithm Hash digest
SHA256 c6d8a784efb4d3ec8f9f66e764ebf59e9bc5d7043350aaccdbef5c7af23f4738
MD5 e8abedfb18e35383175eccc3d00c6e31
BLAKE2b-256 0cdf6de1de8c0f661572205b85cdf24a71b39c5698b769058ea5b00d204f347b

See more details on using hashes here.

File details

Details for the file mangoCR-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: mangoCR-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 5.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for mangoCR-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 07e5e4e22370eb1b3876f356950236685a1833158137c3fbdf7a3853c538ceaf
MD5 fb8bc8538b7a36c836e6872a09327055
BLAKE2b-256 0246729f05463d40929c8d995c9d311f3801233a977317c4eb51dce90050d876

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page