Skip to main content

Python binding to Tesseract API

Project description

Pysseract

Build Status

A Python binding to Tesseract API. Tesseract is an open-source tool made available by Google for Optical Character Recognition (OCR) - that is, getting a computer to read the text in an image. Tesseract allows you to perform this task at a number of levels of granularity (one character at a time, one word at a time, and so on), by segmenting the page in a number of different ways (by assuming the whole page is one lump of text, or one line, or sparsely located throughout the source image), and with a number of different language models including ones you have built (pre-built models are available at https://github.com/tesseract-ocr/tessdata among other places).

Pip 19.3.1 or greater is required if you're installing the wheel for this package, otherwise just install the source. On Linux, if you install the wheel Tesseract comes included. You will however need to provide the Tesseract models. An example of how you might do this with English on a linux system is as follows:

curl -O https://raw.githubusercontent.com/tesseract-ocr/tessdata_fast/4.0.0/eng.traineddata
mkdir -p /usr/local/share/tessdata/ && sudo mv eng.traineddata /usr/local/share/tessdata/ 

The reason the file is being put in to /usr/local/share/tessdata/ is because that is the default value for TESSDATA_PREFIX, an environment variable that Tesseract uses to locate model files. You're free to override the value of TESSDATA_PREFIX, of course.

Documentation is hosted on readthedocs.

Basic usage

In order to just get all the text from an image and concatenate it into a string, run the following:

import pysseract
t = pysseract.Pysseract()
t.SetImageFromPath('tests/001-helloworld.png')
print(t.utf8Text)

If instead you want to iterate through the text boxes found in an image at the TEXTLINE level (coarser-grained than WORD, but also lower-level than BLOCK), then you might run the following:

with pysseract.Pysseract() as t:
    boxes = []
    text = []
    conf = []
    LEVEL = pysseract.PageIteratorLevel.TEXTLINE
    for box, text, confidence in t.IterAt(LEVEL):
        lines.append(text)
        boxes.append(box)
        confidence.append(conf)

A third possibility is that you may want to control how exactly the image is segmented. This is done before instantiating a ResultIterator, as follows:

with pysseract.Pysseract() as t:
    t.pageSegMode = pysseract.PageSegMode.SINGLE_BLOCK
    t.SetImageFromPath("002-quick-fox.jpg")
    t.SetSourceResolution(70)
    boxes = []
    text = []
    conf = []
    LEVEL = pysseract.PageIteratorLevel.TEXTLINE
    for box, text, confidence in t.IterAt(LEVEL):
        lines.append(text)
        boxes.append(box)
        confidence.append(conf)

Finally, if you want to work with the low-level iterator built into Tesseract, the below code will work for you. This is primarily intended for people who want fine-grain control when searching through the results. For instance, if you want to look at the first paragraph, jump to the next word, then the next block after that, then the next symbol after that, you would use this approach:

t = pysseract.Pysseract()
t.SetImageFromPath("002-quick-fox.jpg")
resIter = t.GetIterator()
boxes = []
lines = []
confidence = []

# First, look at the paragraph level
level = pysseract.PageIteratorLevel.PARA
boxes.append(resIter.BoundingBox(level))
lines.append(resIter.GetUTF8Text(level))
confidence.append(resIter.Confidence(level))

# Now the next word after the paragraph we just looked at
level = pysseract.PageIteratorLevel.WORD
resIter.Next(level)
boxes.append(resIter.BoundingBox(level))
lines.append(resIter.GetUTF8Text(level))
confidence.append(resIter.Confidence(level))

# Now the next block
level = pysseract.PageIteratorLevel.BLOCK
resIter.Next(level)
boxes.append(resIter.BoundingBox(level))
lines.append(resIter.GetUTF8Text(level))
confidence.append(resIter.Confidence(level))

# Lastly, look at the next symbol after the block we just looked at
level = pysseract.PageIteratorLevel.SYMBOL
resIter.Next(level)
boxes.append(resIter.BoundingBox(level))
lines.append(resIter.GetUTF8Text(level))
confidence.append(resIter.Confidence(level))

Building the package

Requirements

  • gcc/clang with at least c++11 support
  • libtesseract, libtesseract-dev (equivalent on non-Debian/Ubuntu systems)
  • pybind11>=2.2
python3 setup.py build install test

Building the documentation

pip install sphinx sphinx_rtd_theme m2r
python3 setup.py build_sphinx

You should find the generated html in build/sphinx.

Contribute

Look at Tesseract BaseAPI and import those functions of interest to pymodule.cpp.

Please write a brief description in your wrapper function like those already in pymodule.cpp.

Reference

LICENSE

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysseract-1.3.1.tar.gz (13.2 kB view details)

Uploaded Source

Built Distributions

pysseract-1.3.1-cp38-cp38-manylinux2014_x86_64.whl (25.6 MB view details)

Uploaded CPython 3.8

pysseract-1.3.1-cp38-cp38-manylinux2014_i686.whl (18.9 MB view details)

Uploaded CPython 3.8

pysseract-1.3.1-cp37-cp37m-manylinux2014_x86_64.whl (25.6 MB view details)

Uploaded CPython 3.7m

pysseract-1.3.1-cp37-cp37m-manylinux2014_i686.whl (18.9 MB view details)

Uploaded CPython 3.7m

pysseract-1.3.1-cp36-cp36m-manylinux2014_x86_64.whl (25.6 MB view details)

Uploaded CPython 3.6m

pysseract-1.3.1-cp36-cp36m-manylinux2014_i686.whl (18.9 MB view details)

Uploaded CPython 3.6m

pysseract-1.3.1-cp35-cp35m-manylinux2014_x86_64.whl (25.6 MB view details)

Uploaded CPython 3.5m

pysseract-1.3.1-cp35-cp35m-manylinux2014_i686.whl (18.9 MB view details)

Uploaded CPython 3.5m

File details

Details for the file pysseract-1.3.1.tar.gz.

File metadata

  • Download URL: pysseract-1.3.1.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1.tar.gz
Algorithm Hash digest
SHA256 416658ef9335f22b928b131a7392ddf073e20b9382f139fc42a50a4cca9eabfe
MD5 065e56567f57ffb61173b6accae57379
BLAKE2b-256 175fd94d064e54254cef5bade2ab69168e2ebff6c6edfef8998ababaca01a667

See more details on using hashes here.

File details

Details for the file pysseract-1.3.1-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

  • Download URL: pysseract-1.3.1-cp38-cp38-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 25.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1-cp38-cp38-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3d3265190084c0ff1b150b656c12c3e67316c3715bbc2062272fee35a2980a1c
MD5 db26febd6eeff74063112fe09a36de3f
BLAKE2b-256 0bb0b14e6326590e3a6bd0ffe1b15d03f10cfdccd61cec7b65b7bd941bbe4483

See more details on using hashes here.

File details

Details for the file pysseract-1.3.1-cp38-cp38-manylinux2014_i686.whl.

File metadata

  • Download URL: pysseract-1.3.1-cp38-cp38-manylinux2014_i686.whl
  • Upload date:
  • Size: 18.9 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1-cp38-cp38-manylinux2014_i686.whl
Algorithm Hash digest
SHA256 0c8a974c2bbf79cfd56bdef50526aa8d2e35214e2863a14c636f58eaeb2433e3
MD5 8c14422befa66b64e28fc22575daed1d
BLAKE2b-256 baa4950b098c94ffb1fde8b7aab135c974179a2d6d90beeecc54acdffc8e7393

See more details on using hashes here.

File details

Details for the file pysseract-1.3.1-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: pysseract-1.3.1-cp37-cp37m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 25.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 085cf06aecf06e03e5ecdef7fdec9b4677a29248a68aadd7e9e8054ec03c3835
MD5 06604fd6cf06571af478d51b8ac41b4b
BLAKE2b-256 d476f3b44dfec943d8134a7c70ab273aecc855cf5284043aa58d4280007ca702

See more details on using hashes here.

File details

Details for the file pysseract-1.3.1-cp37-cp37m-manylinux2014_i686.whl.

File metadata

  • Download URL: pysseract-1.3.1-cp37-cp37m-manylinux2014_i686.whl
  • Upload date:
  • Size: 18.9 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1-cp37-cp37m-manylinux2014_i686.whl
Algorithm Hash digest
SHA256 27fa256a2eb7338880722545f9a37abfaf262ee2593462a1054acaa9174b25ad
MD5 ec67c5ea0fa26e59cc83a20dd2e1dceb
BLAKE2b-256 d1c629498e76ddabd1f4fca893d2110f3ed7c5dfe0850dba31399925e1d955ab

See more details on using hashes here.

File details

Details for the file pysseract-1.3.1-cp36-cp36m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: pysseract-1.3.1-cp36-cp36m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 25.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1-cp36-cp36m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 afd0343cd686c861492e8f375c6020387b213fac97f845266fdace66462716c3
MD5 071b9547fcad263d2fbbaf2be9cc0616
BLAKE2b-256 faf097462e148abff0011173294221046986e729c0e4475ad2a914bcd66ee2d6

See more details on using hashes here.

File details

Details for the file pysseract-1.3.1-cp36-cp36m-manylinux2014_i686.whl.

File metadata

  • Download URL: pysseract-1.3.1-cp36-cp36m-manylinux2014_i686.whl
  • Upload date:
  • Size: 18.9 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1-cp36-cp36m-manylinux2014_i686.whl
Algorithm Hash digest
SHA256 919f9248fa65b9eeee955a6ad1f557f6704f169eb6e71f2f5baf34ecdcad6b37
MD5 f900255117c2e8bfa87f54997448244e
BLAKE2b-256 a82a384de12812edeb62eaca08943723cda14af0f51297afc1173427e5f2a0e9

See more details on using hashes here.

File details

Details for the file pysseract-1.3.1-cp35-cp35m-manylinux2014_x86_64.whl.

File metadata

  • Download URL: pysseract-1.3.1-cp35-cp35m-manylinux2014_x86_64.whl
  • Upload date:
  • Size: 25.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1-cp35-cp35m-manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e1e052e02925ee57fdec2d8d99242ac039c642e47bbc4a2e90d5aa2c8dcc790d
MD5 37b0f3559fbfd7f4915baf21f6fea935
BLAKE2b-256 8be00173942e36efb1467213f18d7bbbebdb17ad5cb39af9e65853c6cdfbaec5

See more details on using hashes here.

File details

Details for the file pysseract-1.3.1-cp35-cp35m-manylinux2014_i686.whl.

File metadata

  • Download URL: pysseract-1.3.1-cp35-cp35m-manylinux2014_i686.whl
  • Upload date:
  • Size: 18.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.6.7

File hashes

Hashes for pysseract-1.3.1-cp35-cp35m-manylinux2014_i686.whl
Algorithm Hash digest
SHA256 f1d1aebea266baef58924a417e5cb5e8213a7e7937490ad84e256f5fdfafa000
MD5 1dc58d463dd27f23f86c0395d54c383a
BLAKE2b-256 a28258831c00ba41319d227680160289471872b20a44064fa50bbe8645d887fc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page