Converts a scanned PDF into an OCR'ed pdf using Tesseract-OCR and Ghostscript
Project description
This script will take a pdf file and generate the corresponding OCR’ed version.
Usage:
Single conversion:
python pypdfocr.py filename.pdf --> filename_ocr.pdf will be generated
Folder monitoring (new!):
python pypdfocr.py -w watch_directory --> Every time a pdf file is added to `watch_directory` it will be OCR'ed
For those on Windows, because it’s such a pain to get all the PIL and PDF dependencies installed, I’ve gone ahead and made an executable available under:
dist/pypdfocr.exe
You still need to install Tesseract and GhostScript as detailed below in the dependencies list.
Caveats
This code is brand-new, and is barely commented with no unit-tests included. I plan to improve things as time allows in the near-future.
Dependencies:
PyPDFOCR relies on the following (free) programs being installed and in the path:
Tesseract OCR software https://code.google.com/p/tesseract-ocr/
GhostScript http://www.ghostscript.com/
PIL (Python Imaging Library) http://www.pythonware.com/products/pil/
ReportLab (PDF generation library) http://www.reportlab.com/software/opensource/
Watchdog (Cross-platform fhlesystem events monitoring) https://pypi.python.org/pypi/watchdog
On Mac OS X, you can install the first two using homebrew:
brew install tesseract brew install ghostscript
The last three can be installed using a regular python manager such as pip:
pip install pil pip install reportlab pip install watchdog
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pypdfocr-0.2.tar.gz.
File metadata
- Download URL: pypdfocr-0.2.tar.gz
- Upload date:
- Size: 8.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8aac482a39a17979469f9142826a7ae031c6f2b1a8f7216e6338580b2ae53efd
|
|
| MD5 |
550edbfecffd87fdfeb741729fcd57ee
|
|
| BLAKE2b-256 |
828175e0d6798d55a6709b36fc7cc4083d809f224fae03dddc50bf89aa8cbe9c
|