Skip to main content

Light-weight OCR engine.

Project description

This library provides a clean interface to segment and recognize text in an image. It’s optimized for printed text, e.g. scanned documents and website screenshots.

Python version Github release PyPI version PyPI status

Installation

pip install liteocr

The installation includes both the liteocr Python3 library and a command line executable.

Usage

>> liteocr

Performs OCR on an image file and writes the recognition results to JSON.

usage: LiteOCR [-h] [-d] [--extra-whitelist str] [--all-unicode] [--lang str]
               [--min-text-size int] [--max-text-size int]
               [--uniformity-thresh :0.0<=float<1.0]
               [--thin-line-thresh :odd int] [--conf-thresh :0<=int<100]
               [--box-expand-factor :0.0<=float<1.0]
               [--horizontal-pooling int]
               str str

positional arguments:
  str                   image file
  str                   output JSON file

optional arguments:
  -h, --help            show this help message and exit
  -d, --display         display recognized bounding boxes and text on top of the image

engine:
  parameters to liteocr.OCREngine constructor

  --extra-whitelist str
                        string of extra chars for Tesseract to consider only
                        takes effect when all_unicode is False
  --all-unicode         if True, Tesseract will consider all possible unicode
                        characters
  --lang str            language in the text. Defaults to English.

recognition:
  parameters to OCREngine.recognize() method

  --min-text-size int   min text height/width in pixels, below which will be
                        ignored
  --max-text-size int   max text height/width in pixels, above which will be
                        ignored
  --uniformity-thresh :0.0<=float<1.0
                        ignore a region if the number of pixels neither black
                        nor white < [thresh]
  --thin-line-thresh :odd int
                        remove all lines thinner than [thresh] pixels.can be
                        used to remove the thin borders of web page textboxes.
  --conf-thresh :0<=int<100
                        ignore regions with OCR confidence < thresh.
  --box-expand-factor :0.0<=float<1.0
                        expand the bounding box outwards in case certain chars
                        are cutoff.
  --horizontal-pooling int
                        result bounding boxes will be more connected with more
                        pooling, but large pooling might lower accuracy.

Python3 library

from liteocr import OCREngine, load_img, draw_rect, draw_text, disp

image_file = 'my_img.png'
img = load_img(image_file)

# you can either use context manager or call engine.close() manually at the end.
with OCREngine() as engine:
    # engine.recognize() can accept a file name, a numpy image, or a PIL image.
    for text, box, conf in engine.recognize(image_file):
        print(box, '\tconfidence =', conf, '\ttext =', text)
        draw_rect(img, box)
        draw_text(img, text, box, color='bw')

# display the image with recognized text boxes overlaid
disp(img, pause=False)

Notes

I deprecated and moved the old code into a separate folder. The old API calls Tesseract directly on the entire image. The low recall wasn’t trivial to fix at all, as I realized later:

  • The command-line Tesseract makes really weird global page segmentation decisions. It ignores certain text regions with no apparent patterns. I have tried many different combinations of a handful of tuneable parameters, but none of them helps. My hands are tied because Tesseract is poorly documented and very few people asks such questions on Stackoverflow.

  • Tesserocr is a python package that builds a .pyx wrapper around Tesseract’s C++ API. There are a few native API methods that can iterate through text regions, but they randomly fail with SegFault (ughh!!!). I spent a lot of time trying to fix it, but gave up in despair …

  • Tesseract is the best open-source OCR engine, which means I don’t have other choices. I thought about using Google’s online OCR API, but we shouldn’t be bothered by internet connection and API call limits.

So I ended up using a new workflow:

  1. Apply OpenCV magic to produce better text segmentation.

  2. Run Tesseract on each of the segmented text box. It’s much more transparent than running on the whole image.

  3. Collect text result and mean confidence level (yield as a generator).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

liteocr-0.2.1.tar.gz (8.5 kB view details)

Uploaded Source

Built Distribution

liteocr-0.2.1-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file liteocr-0.2.1.tar.gz.

File metadata

  • Download URL: liteocr-0.2.1.tar.gz
  • Upload date:
  • Size: 8.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for liteocr-0.2.1.tar.gz
Algorithm Hash digest
SHA256 5007798261954bcbee9b620c8551d7eaee5d85431c0c9f6044b548b7c30d3871
MD5 637335ed9c26e29f6fb34bad751f3075
BLAKE2b-256 44634dc0a0ed33ac159c3b133773b7f35fdac532107090106a28ed554674c40c

See more details on using hashes here.

File details

Details for the file liteocr-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for liteocr-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f61f37ac8b3c802d28b6148424f3106b3de5a31dddbeb07d5b0a660c5c14ddc1
MD5 c3de3788d63ec917c82f983028bbe572
BLAKE2b-256 c11b92c7ef25b8d424ae07b4d7530e81e5542f68c625df15d05121d4f34cfce0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page