pypdfocr

Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript

Project description

This script will take a pdf file and generate the corresponding OCR’ed version.

Usage:

Single conversion:

python pypdfocr.py filename.pdf

--> filename_ocr.pdf will be generated

Folder monitoring (new!):

python pypdfocr.py -w watch_directory

--> Every time a pdf file is added to `watch_directory` it will be OCR'ed

For those on Windows, because it’s such a pain to get all the PIL and PDF dependencies installed, I’ve gone ahead and made an executable available under:

dist/pypdfocr.exe

You still need to install Tesseract and GhostScript as detailed below in the dependencies list.

Caveats

This code is brand-new, and is barely commented with no unit-tests included. I plan to improve things as time allows in the near-future.

Dependencies:

PyPDFOCR relies on the following (free) programs being installed and in the path:

Tesseract OCR software https://code.google.com/p/tesseract-ocr/
GhostScript http://www.ghostscript.com/
PIL (Python Imaging Library) http://www.pythonware.com/products/pil/
ReportLab (PDF generation library) http://www.reportlab.com/software/opensource/
Watchdog (Cross-platform fhlesystem events monitoring) https://pypi.python.org/pypi/watchdog

On Mac OS X, you can install the first two using homebrew:

brew install tesseract
brew install ghostscript

The last three can be installed using a regular python manager such as pip:

pip install pil
pip install reportlab
pip install watchdog

Project details

Release history Release notifications | RSS feed

0.9.1

Oct 11, 2016

0.9.0

Mar 2, 2016

0.8.5

Feb 22, 2016

0.8.4

Feb 18, 2016

0.8.3

Feb 18, 2016

0.8.2

Dec 8, 2014

0.8.1

Dec 5, 2014

0.8.0

Oct 27, 2014

0.7.6

Sep 10, 2014

0.7.5

Aug 18, 2014

0.7.4

Mar 28, 2014

0.7.3

Mar 27, 2014

0.7.2

Mar 26, 2014

0.7.1

Mar 26, 2014

0.7.0

Mar 25, 2014

0.6.1

Feb 16, 2014

0.6.0

Jan 17, 2014

0.5.4

Jan 13, 2014

0.5.3

Dec 12, 2013

0.5.2

Dec 11, 2013

0.5.1

Nov 3, 2013

0.5

Oct 31, 2013

0.4.1

Oct 29, 2013

0.4

Oct 28, 2013

0.3.1

Oct 24, 2013

0.3

Oct 23, 2013

0.2.2

Oct 22, 2013

0.2.1

Oct 22, 2013

This version

0.2

Oct 22, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypdfocr-0.2.tar.gz (8.6 kB view hashes)

Uploaded Oct 22, 2013 Source

Hashes for pypdfocr-0.2.tar.gz

Hashes for pypdfocr-0.2.tar.gz
Algorithm	Hash digest
SHA256	`8aac482a39a17979469f9142826a7ae031c6f2b1a8f7216e6338580b2ae53efd`
MD5	`550edbfecffd87fdfeb741729fcd57ee`
BLAKE2b-256	`828175e0d6798d55a6709b36fc7cc4083d809f224fae03dddc50bf89aa8cbe9c`