Skip to main content

Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript

Project description

This script will take a pdf file and generate the corresponding OCR’ed version.

Usage:

Single conversion:

python pypdfocr.py filename.pdf

--> filename_ocr.pdf will be generated

Folder monitoring (new!):

python pypdfocr.py -w watch_directory

--> Every time a pdf file is added to `watch_directory` it will be OCR'ed

For those on Windows, because it’s such a pain to get all the PIL and PDF dependencies installed, I’ve gone ahead and made an executable available under:

dist/pypdfocr.exe

You still need to install Tesseract and GhostScript as detailed below in the dependencies list.

Caveats

This code is brand-new, and is barely commented with no unit-tests included. I plan to improve things as time allows in the near-future.

Dependencies:

PyPDFOCR relies on the following (free) programs being installed and in the path:

On Mac OS X, you can install the first two using homebrew:

brew install tesseract
brew install ghostscript

The last three can be installed using a regular python manager such as pip:

pip install pil
pip install reportlab
pip install watchdog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypdfocr-0.2.tar.gz (8.6 kB view details)

Uploaded Source

File details

Details for the file pypdfocr-0.2.tar.gz.

File metadata

  • Download URL: pypdfocr-0.2.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pypdfocr-0.2.tar.gz
Algorithm Hash digest
SHA256 8aac482a39a17979469f9142826a7ae031c6f2b1a8f7216e6338580b2ae53efd
MD5 550edbfecffd87fdfeb741729fcd57ee
BLAKE2b-256 828175e0d6798d55a6709b36fc7cc4083d809f224fae03dddc50bf89aa8cbe9c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page