Tesserocr bindings
Project description
ocrd_tesserocr
Crop, deskew, segment into regions / tables / lines / words, or recognize with tesserocr
Introduction
This offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr . (Each processor is a step in the OCR-D functional model, and can be replaced with an alternative implementation. Data is represented within METS/PAGE.)
This includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, table, line, word segmentation) and OCR proper. Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. Image results are referenced (read and written) via AlternativeImage
, text results via TextEquiv
, deskewing via @orientation
, cropping via Border
and segmentation via Region
/ TextLine
/ Word
elements with Coords/@points
.
Installation
Required ubuntu packages:
- Tesseract headers (
libtesseract-dev
) - Some Tesseract language models (
tesseract-ocr-{eng,deu,frk,...}
or script models (tesseract-ocr-script-{latn,frak,...}
) - Leptonica headers (
libleptonica-dev
)
From PyPI
This is the best option if you want to use the stable, released version.
NOTE
ocrd_tesserocr requires Tesseract >= 4.1.0. The Tesseract packages bundled with Ubuntu < 19.10 are too old. If you are on Ubuntu 18.04 LTS, please enable Alexander Pozdnyakov PPA which has up-to-date builds of Tesseract and its dependencies:
sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update
sudo apt-get install git python3 python3-pip libtesseract-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr wget
pip install ocrd_tesserocr
With docker
This is the best option if you want to run the software in a container.
You need to have Docker
docker pull ocrd/tesserocr
To run with docker:
docker run -v path/to/workspaces:/data ocrd/tesserocr ocrd-tesserocrd-crop ...
From git
This is the best option if you want to change the source code or install the latest, unpublished changes.
We strongly recommend to use venv.
git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
sudo make deps-ubuntu # or manually with apt-get
make deps # or pip install -r requirements
make install # or pip install .
Usage
See docstrings and in the individual processors and ocrd-tool.json descriptions.
Available processors are:
- ocrd-tesserocr-crop
- ocrd-tesserocr-deskew
- ocrd-tesserocr-binarize
- ocrd-tesserocr-segment-region
- ocrd-tesserocr-segment-table
- ocrd-tesserocr-segment-line
- ocrd-tesserocr-segment-word
- ocrd-tesserocr-recognize
Testing
make test
This downloads some test data from https://github.com/OCR-D/assets under repo/assets
, and runs some basic test of the Python API as well as the CLIs.
Set PYTEST_ARGS="-s --verbose"
to see log output (-s
) and individual test results (--verbose
).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ocrd_tesserocr-0.8.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6523dd1b3131a8061f2a5cfae7e7b6a90e5bf7bd917fbd5f38b948e53b7866b9 |
|
MD5 | d8381d462e3c132e5215a4d50c9a5582 |
|
BLAKE2b-256 | d06f7f16c5f39824afe6d0baf5fb7ba63c8771dda149948f3b98d48738b00dfc |