Skip to main content

wrap Tesseract preprocessing, segmentation and recognition

Project description

ocrd_tesserocr

Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr

image image image Docker Automated build

Introduction

This package offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr. (Each processor is a parameterizable step in a configurable workflow of the OCR-D functional model. There are usually various alternative processor implementations for each step. Data is represented with METS and PAGE.)

It includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation), script identification, font style recognition and text recognition.

Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. In PAGE, image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, font attributes via TextStyle, script via @primaryScript, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.

Installation

With docker

This is the best option if you want to run the software in a container.

You need to have Docker

docker pull ocrd/tesserocr

To run with docker:

docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...

From PyPI and Tesseract provided by system

If your operating system / distribution already provides Tesseract 4.1 or newer, then just install its development package:

# on Debian / Ubuntu:
sudo apt install libtesseract-dev

Otherwise, recent Tesseract packages for Ubuntu are available via PPA alex-p, which has up-to-date builds of Tesseract and its dependencies:

# on Debian / Ubuntu
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt install libtesseract-dev

Once Tesseract is available, just install ocrd_tesserocr from PyPI server:

pip install ocrd_tesserocr

We strongly recommend setting up a venv first.

From git

Use this option if there is no suitable prebuilt version of Tesseract available on your system, or you want to change the source code or install the latest, unpublished changes.

git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
# install Tesseract:
sudo make deps-ubuntu # system dependencies just for the build
make deps
# install tesserocr and ocrd_tesserocr:
make install

We strongly recommend setting up a venv first.

Models

Tesseract comes with synthetically trained models for languages (tesseract-ocr-{eng,deu,deu_latf,...} or scripts (tesseract-ocr-script-{latn,frak,...}). In addition, various models trained on scan data are available from the community.

Since all OCR-D processors must resolve file/data resources in a standardized way, and we want to stay interoperable with standalone Tesseract (which uses a single compile-time tessdata directory), ocrd-tesserocr-recognize expects the recognition models to be installed in its module resource location only. The module location is determined by the underlying Tesseract installation (compile-time tessdata directory, or run-time $TESSDATA_PREFIX environment variable). Other resource locations (data/system/cwd) will be ignored, and should not be used when installing models with the Resource Manager (ocrd resmgr download).

To see the module resource location of your installation:

ocrd-tesserocr-recognize -D

For a full description of available commands for resource management, see:

ocrd resmgr --help
ocrd resmgr list-available --help
ocrd resmgr download --help
ocrd resmgr list-installed --help

Note: (In previous versions, the resource locations of standalone Tesseract and the OCR-D wrapper were different. If you already have models under $XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize, usually ~/.local/share/ocrd-resources/ocrd-tesserocr-recognize, then consider moving them to the new default under ocrd-tesserocr-recognize -D, usually /usr/share/tesseract-ocr/4.00/tessdata, or alternatively overriding the module directory by setting TESSDATA_PREFIX=$XDG_DATA_HOME/ocrd-resources/ocrd-tesserocr-recognize in the environment.)

Cf. OCR-D model guide.

Models always use the filename suffix .traineddata, but are just loaded by their basename. You will need at least eng and osd installed (even for segmentation and deskewing), probably also Latin and Fraktur etc. So to get minimal models, do:

ocrd resmgr download ocrd-tesserocr-recognize eng.traineddata
ocrd resmgr download ocrd-tesserocr-recognize osd.traineddata

(This will already be installed if using the Docker or git installation option.)

As of v0.13.1, you can configure ocrd-tesserocr-recognize to select models dynamically segment by segment, either via custom conditions on the PAGE-XML annotation (presented as XPath rules), or by automatically choosing the model with highest confidence.

Usage

For details, see docstrings in the individual processors and ocrd-tool.json descriptions, or simply --help.

Available OCR-D processors are:

  • ocrd-tesserocr-crop (simplistic)
    • sets Border of pages and adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-deskew (for skew and orientation; mind operation_level)
    • sets @orientation of regions or pages and adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-binarize (Otsu – not recommended, unless already binarized and using tiseg)
    • adds AlternativeImage files to the output fileGrp
  • ocrd-tesserocr-recognize (optionally including segmentation; mind segmentation_level and textequiv_level)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions, ReadingOrder and AlternativeImage to Page and sets their @orientation (optionally)
    • adds TextRegions to TableRegions and sets their @orientation (optionally)
    • adds TextLines to TextRegions (optionally)
    • adds Words to TextLines (optionally)
    • adds Glyphs to Words (optionally)
    • adds TextEquiv
  • ocrd-tesserocr-segment (all-in-one segmentation – recommended; delegates to recognize)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions, ReadingOrder and AlternativeImage to Page and sets their @orientation
    • adds TextRegions to TableRegions and sets their @orientation
    • adds TextLines to TextRegions
    • adds Words to TextLines
    • adds Glyphs to Words
  • ocrd-tesserocr-segment-region (only regions – with overlapping bboxes; delegates to recognize)
    • adds TextRegions, TableRegions, ImageRegions, MathsRegions, SeparatorRegions, NoiseRegions and ReadingOrder to Page and sets their @orientation
  • ocrd-tesserocr-segment-table (only table cells; delegates to recognize)
    • adds TextRegions to TableRegions
  • ocrd-tesserocr-segment-line (only lines – from overlapping regions; delegates to recognize)
    • adds TextLines to TextRegions
  • ocrd-tesserocr-segment-word (only words; delegates to recognize)
    • adds Words to TextLines
  • ocrd-tesserocr-fontshape (only text style – via Tesseract 3 models)
    • adds TextStyle to Words

The text region @types detected are (from Tesseract's PolyBlockType):

  • paragraph: normal block (aligned with others in the column)
  • floating: unaligned block (is in a cross-column pull-out region)
  • heading: block that spans more than one column
  • caption: block for text that belongs to an image

If you are unhappy with these choices, then consider post-processing with a dedicated custom processor in Python, or by modifying the PAGE files directly (e.g. xmlstarlet ed --inplace -u '//pc:TextRegion/@type[.="floating"]' -v paragraph filegrp/*.xml).

All segmentation is currently done as bounding boxes only by default, i.e. without precise polygonal outlines. For dense page layouts this means that neighbouring regions and neighbouring text lines may overlap a lot. If this is a problem for your workflow, try post-processing like so:

  • after line segmentation: use ocrd-cis-ocropy-resegment for polygonalization, or ocrd-cis-ocropy-clip on the line level
  • after region segmentation: use ocrd-segment-repair with plausibilize (and sanitize after line segmentation)

It also means that Tesseract should be allowed to segment across multiple hierarchy levels at once, to avoid introducing inconsistent/duplicate text line assignments in text regions, or word assignments in text lines. Hence,

  • prefer ocrd-tesserocr-recognize with segmentation_level=region
    over ocrd-tesserocr-segment followed by ocrd-tesserocr-recognize,
    if you want to do all in one with Tesseract,
  • prefer ocrd-tesserocr-recognize with segmentation_level=line
    over ocrd-tesserocr-segment-line followed by ocrd-tesserocr-recognize,
    if you want to do everything but region segmentation with Tesseract,
  • prefer ocrd-tesserocr-segment over ocrd-tesserocr-segment-region
    followed by (ocrd-tesserocr-segment-table and) ocrd-tesserocr-segment-line,
    if you want to do everything but recognition with Tesseract.

However, you can also run ocrd-tesserocr-segment* and ocrd-tesserocr-recognize with shrink_polygons=True to get polygons by post-processing each segment, shrinking to the convex hull of all its symbol outlines.

Testing

make test

This downloads some test data from https://github.com/OCR-D/assets under repo/assets, and runs some basic test of the Python API as well as the CLIs.

Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd_tesserocr-0.19.1.tar.gz (45.6 kB view details)

Uploaded Source

Built Distribution

ocrd_tesserocr-0.19.1-py3-none-any.whl (53.6 kB view details)

Uploaded Python 3

File details

Details for the file ocrd_tesserocr-0.19.1.tar.gz.

File metadata

  • Download URL: ocrd_tesserocr-0.19.1.tar.gz
  • Upload date:
  • Size: 45.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.9.19

File hashes

Hashes for ocrd_tesserocr-0.19.1.tar.gz
Algorithm Hash digest
SHA256 c72c204bca20d92bc42a413295d0acbe404939891d310030d67b5ee52b8c3ebc
MD5 4fa9b7f80216b58a13bcaa27efcfcc8b
BLAKE2b-256 bbb1a8bb00b936f363b5e1e26407d25d0cf2c22626650752c1d5f80e9517378b

See more details on using hashes here.

File details

Details for the file ocrd_tesserocr-0.19.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ocrd_tesserocr-0.19.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b5060735b57a94726db4d5fc4de64305a6121b49df90ced03928fbba107794d2
MD5 2c47c9b8d1e4316cf87c7b42a44f6e83
BLAKE2b-256 28dd61ddc3b27c4e199764db053318deb9d8d71f37687840ac8e9072f63900bd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page