Skip to main content

Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript

Project description

This script will take a pdf file and generate the corresponding OCR’ed version.

Usage:

Single conversion:

python pypdfocr.py filename.pdf

--> filename_ocr.pdf will be generated

Folder monitoring (new!):

python pypdfocr.py -w watch_directory

--> Every time a pdf file is added to `watch_directory` it will be OCR'ed

For those on Windows, because it’s such a pain to get all the PIL and PDF dependencies installed, I’ve gone ahead and made an executable available under:

dist/pypdfocr.exe

You still need to install Tesseract and GhostScript as detailed below in the dependencies list.

Caveats

This code is brand-new, and is barely commented with no unit-tests included. I plan to improve things as time allows in the near-future.

Dependencies:

PyPDFOCR relies on the following (free) programs being installed and in the path:

On Mac OS X, you can install the first two using homebrew:

brew install tesseract
brew install ghostscript

The last three can be installed using a regular python manager such as pip:

pip install pil
pip install reportlab
pip install watchdog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypdfocr-0.2.1.tar.gz (12.6 kB view details)

Uploaded Source

File details

Details for the file pypdfocr-0.2.1.tar.gz.

File metadata

  • Download URL: pypdfocr-0.2.1.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pypdfocr-0.2.1.tar.gz
Algorithm Hash digest
SHA256 e43558b2dc3ea8e7d2a90c1caf396b1857276de360cf11eeae2179101d69d16b
MD5 c33d198b0ba37284bd27c748a6ad2e67
BLAKE2b-256 b72ce12ca87130502b0567a9143133146925dda0ba028aa64721a22ba0dae272

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page