Tesserocr bindings
Project description
Crop, deskew, segment into regions / lines / words, or recognize with tesserocr
Introduction
This offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr . (Each processor is a step in the OCR-D functional model, and can be replaced with an alternative implementation. Data is represented within METS/PAGE.)
This includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, line, word segmentation) and OCR proper. Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. Image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.
Installation
Required ubuntu packages:
Tesseract headers (libtesseract-dev)
Some tesseract language models (tesseract-ocr-{eng,deu,frk,...} or script models (tesseract-ocr-script-{latn,frak,...})
Leptonica headers (libleptonica-dev)
make deps-ubuntu # or manually make deps # or pip install -r requirements make install # or pip install .
If tesserocr fails to compile with an error::
$PREFIX/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type; did you mean ‘stdin’? static string CleanupString(const char* utf8_str) { ^~~~~~ stdin
This is due to some inconsistencies in the installed tesseract C headers (fix expected for next Ubuntu upgrade, already fixed for Debian). Replace string with std::string in $PREFIX/include/tesseract/unicharset.h:265:5: and $PREFIX/include/tesseract/unichar.h:164:10: ff.
If tesserocr fails with an error about LSTM/CUBE, you have a mismatch between tesseract header/data/pkg-config versions. apt policy libtesseract-dev lists the apt-installable versions, keep it consistent. Make sure there are no spurious pkg-config artifacts, e.g. in /usr/local/lib/pkgconfig/tesseract.pc. The same goes for language models.
Usage
See docstrings and in the individual processors and ocrd-tool.json descriptions.
Available processors are:
Testing
make test
This downloads some test data from <https://github.com/OCR-D/assets> under repo/assets, and runs some basic test of the Python API as well as the CLIs.
Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ocrd_tesserocr-0.4.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4822713547e696dbb327a80f9dd5bad705be4b7dc1f44fdef1d44f9e03c21c1d |
|
MD5 | 9d5ea4deb4c75bae31b7d44a4a8fdd0a |
|
BLAKE2b-256 | ee2b483b44bf3180e81aa8a5bf7307ae47da4d1656e69dec1a704f9a8d558b88 |