Light-weight OCR engine.
This library provides a clean interface to segment and recognize text in an image. It’s optimized for printed text, e.g. scanned documents and website screenshots.
pip install liteocr
The installation includes both the liteocr Python3 library and a command line executable.
Performs OCR on an image file and writes the recognition results to JSON.
usage: LiteOCR [-h] [-d] [--extra-whitelist str] [--all-unicode] [--lang str] [--min-text-size int] [--max-text-size int] [--uniformity-thresh :0.0<=float<1.0] [--thin-line-thresh :odd int] [--conf-thresh :0<=int<100] [--box-expand-factor :0.0<=float<1.0] [--horizontal-pooling int] str str positional arguments: str image file str output JSON file optional arguments: -h, --help show this help message and exit -d, --display display recognized bounding boxes and text on top of the image engine: parameters to liteocr.OCREngine constructor --extra-whitelist str string of extra chars for Tesseract to consider only takes effect when all_unicode is False --all-unicode if True, Tesseract will consider all possible unicode characters --lang str language in the text. Defaults to English. recognition: parameters to OCREngine.recognize() method --min-text-size int min text height/width in pixels, below which will be ignored --max-text-size int max text height/width in pixels, above which will be ignored --uniformity-thresh :0.0<=float<1.0 ignore a region if the number of pixels neither black nor white < [thresh] --thin-line-thresh :odd int remove all lines thinner than [thresh] pixels.can be used to remove the thin borders of web page textboxes. --conf-thresh :0<=int<100 ignore regions with OCR confidence < thresh. --box-expand-factor :0.0<=float<1.0 expand the bounding box outwards in case certain chars are cutoff. --horizontal-pooling int result bounding boxes will be more connected with more pooling, but large pooling might lower accuracy.
from liteocr import OCREngine, load_img, draw_rect, draw_text, disp image_file = 'my_img.png' img = load_img(image_file) # you can either use context manager or call engine.close() manually at the end. with OCREngine() as engine: # engine.recognize() can accept a file name, a numpy image, or a PIL image. for text, box, conf in engine.recognize(image_file): print(box, '\tconfidence =', conf, '\ttext =', text) draw_rect(img, box) draw_text(img, text, box, color='bw') # display the image with recognized text boxes overlaid disp(img, pause=False)
I deprecated and moved the old code into a separate folder. The old API calls Tesseract directly on the entire image. The low recall wasn’t trivial to fix at all, as I realized later:
- The command-line Tesseract makes really weird global page segmentation decisions. It ignores certain text regions with no apparent patterns. I have tried many different combinations of a handful of tuneable parameters, but none of them helps. My hands are tied because Tesseract is poorly documented and very few people asks such questions on Stackoverflow.
- Tesserocr is a python package that builds a .pyx wrapper around Tesseract’s C++ API. There are a few native API methods that can iterate through text regions, but they randomly fail with SegFault (ughh!!!). I spent a lot of time trying to fix it, but gave up in despair …
- Tesseract is the best open-source OCR engine, which means I don’t have other choices. I thought about using Google’s online OCR API, but we shouldn’t be bothered by internet connection and API call limits.
So I ended up using a new workflow:
- Apply OpenCV magic to produce better text segmentation.
- Run Tesseract on each of the segmented text box. It’s much more transparent than running on the whole image.
- Collect text result and mean confidence level (yield as a generator).
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.