Skip to main content

Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript

Project description

This script will take a pdf file and generate the corresponding OCR’ed version.

Usage:

Single conversion:

python pypdfocr.py filename.pdf

--> filename_ocr.pdf will be generated

Folder monitoring (new!):

python pypdfocr.py -w watch_directory

--> Every time a pdf file is added to `watch_directory` it will be OCR'ed

For those on Windows, because it’s such a pain to get all the PIL and PDF dependencies installed, I’ve gone ahead and made an executable available under:

dist/pypdfocr.exe

You still need to install Tesseract and GhostScript as detailed below in the dependencies list.

Caveats

This code is brand-new, and is barely commented with no unit-tests included. I plan to improve things as time allows in the near-future.

Dependencies:

PyPDFOCR relies on the following (free) programs being installed and in the path:

On Mac OS X, you can install the first two using homebrew:

brew install tesseract
brew install ghostscript

The last three can be installed using a regular python manager such as pip:

pip install pil
pip install reportlab
pip install watchdog

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypdfocr-0.2.2.tar.gz (12.8 kB view details)

Uploaded Source

File details

Details for the file pypdfocr-0.2.2.tar.gz.

File metadata

  • Download URL: pypdfocr-0.2.2.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for pypdfocr-0.2.2.tar.gz
Algorithm Hash digest
SHA256 4052f16c048f8020c3225b2abc820512ca688b0c11a72d251a09c854b951f248
MD5 e0f450f2727ec511866c00cd70a089e9
BLAKE2b-256 1a0c9cd45a1bb266a4b101d658ab04589c9edd32efc6d0121f4ea6a5ac24901a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page