CIS OCR-D command line tools
Project description
ocrd_cis
CIS OCR-D command line tools for the automatic post-correction of OCR-results.
Introduction
ocrd_cis
contains different tools for the automatic post correction
of OCR-results. It contains tools for the training, evaluation and
execution of the post correction. Most of the tools are following the
OCR-D cli conventions.
There is a helper tool to align multiple OCR results as well as a version of ocropy that works with python3.
Installation
There are multiple ways to install the ocrd_cis
tools:
make install
usespip
to installocrd_cis
(see below).make install-devel
usespip -e
to installocrd_cis
(see below).pip install --upgrade pip ocrd_cis_dir
pip install -e --upgrade pip ocrd_cis_dir
It is possible to install ocrd_cis
in a custom directory using
virtualenv
:
python3 -m venv venv-dir
source venv-dir/bin/activate
make install # or any other command to install ocrd_cis (see above)
# use ocrd_cis
deactivate
Usage
Most tools follow the OCR-D cli
conventions. They accept the
--input-file-grp
, --output-file-grp
, --parameter
, --mets
,
--log-level
command line arguments (short and long). For some tools
(most notably the alignment tool) expect a comma seperated list of
multiple input file groups.
The ocrd-tool.json contains a schema
description of the parameter config file for the different tools that
accept the --parameter
argument.
ocrd-cis-post-correct.sh
This bash script runs the post correction using a pre-trained model. If additional support OCRs should be used, models for these OCR steps are required and must be configured in an according configuration file (see ocrd-tool.json).
Arguments:
--parameter
path to configuration file--input-file-grp
name of the master-OCR file group--output-file-grp
name of the post-correction file group--log-level
set log level--mets
path to METS file in workspace
ocrd-cis-align
Aligns tokens of multiple input file groups to one output file group. This tool is used to align the master OCR with any additional support OCRs. It accepts a comma-separated list of input file groups, which it aligns in order.
Arguments:
--parameter
path to configuration file--input-file-grp
comma seperated list of the input file groups; first input file group is the master OCR--output-file-grp
name of the file group for the aligned result--log-level
set log level--mets
path to METS file in workspace
ocrd-cis-train.sh
Script to train a model from a list of ground-truth archives (see ocrd-tool.json) for the post correction. The tool somewhat mimics the behaviour of other ocrd tools:
--mets
for the workspace--log-level
is passed to other tools--parameter
is used as configuration--output-file-grp
defines the output file group for the model
ocrd-cis-data
Helper tool to get the path of the installed data files. Usage:
ocrd-cis-data [-jar|-3gs]
to get the path of the jar library or the
path to th default 3-grams language model file.
ocrd-cis-wer
Helper tool to calculate the word error rate aligned ocr files. It writes a simple JSON-formated stats file to the given output file group.
Arguments:
--input-file-grp
input file group of aligned ocr results with their respective ground truth.--output-file-grp
name of the file group for the stats file--log-level
set log level--mets
path to METS file in workspace
ocrd-cis-profile
Run the profiler over the given files of the according the given input file grp and adds a gzipped JSON-formatted profile to the output file group of the workspace. This tools requires an installed language profiler.
Arguments:
--parameter
path to configuration file--input-file-grp
name of the input file group to profile--output-file-grp
name of the output file group where the profile is stored--log-level
set log level--mets
path to METS file in the workspace
ocrd-cis-ocropy-train
The ocropy-train tool can be used to train LSTM models. It takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages. Then a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.
ocrd-cis-ocropy-train \
--input-file-grp OCR-D-GT-SEG-LINE \
--mets mets.xml
--parameter file:///path/to/config.json
ocrd-cis-ocropy-clip
The ocropy-clip tool can be used to remove intrusions of neighbouring segments in regions / lines of a workspace. It runs a (ad-hoc binarization and) connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to white. It references the resulting segment image files in the output PAGE (as AlternativeImage).
ocrd-cis-ocropy-clip \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-CLIP \
--mets mets.xml
--parameter file:///path/to/config.json
ocrd-cis-ocropy-resegment
The ocropy-resegment tool can be used to remove overlap between lines of a workspace. It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
ocrd-cis-ocropy-resegment \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-RES \
--mets mets.xml
--parameter file:///path/to/config.json
ocrd-cis-ocropy-segment
The ocropy-segment tool can be used to segment regions into lines. It runs a (ad-hoc binarization and) line segmentation on every text region of every PAGE in the input file group, and adds a TextLine element with the resulting polygon outline to the annotation of the output PAGE.
ocrd-cis-ocropy-segment \
--input-file-grp OCR-D-SEG-BLOCK \
--output-file-grp OCR-D-SEG-LINE \
--mets mets.xml
--parameter file:///path/to/config.json
ocrd-cis-ocropy-deskew
The ocropy-deskew tool can be used to deskew pages / regions of a workspace. It runs the Ocropy thresholding and deskewing estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
ocrd-cis-ocropy-deskew \
--input-file-grp OCR-D-SEG-LINE \
--output-file-grp OCR-D-SEG-LINE-DES \
--mets mets.xml
--parameter file:///path/to/config.json
ocrd-cis-ocropy-denoise
The ocropy-denoise tool can be used to despeckle pages / regions / lines of a workspace. It runs the Ocropy "nlbin" denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
ocrd-cis-ocropy-denoise \
--input-file-grp OCR-D-SEG-LINE-DES \
--output-file-grp OCR-D-SEG-LINE-DEN \
--mets mets.xml
--parameter file:///path/to/config.json
ocrd-cis-ocropy-binarize
The ocropy-binarize tool can be used to binarize, denoise and deskew pages / regions / lines of a workspace. It runs the Ocropy "nlbin" adaptive thresholding, deskewing estimation and denoising on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
ocrd-cis-ocropy-binarize \
--input-file-grp OCR-D-SEG-LINE-DES \
--output-file-grp OCR-D-SEG-LINE-BIN \
--mets mets.xml
--parameter file:///path/to/config.json
ocrd-cis-ocropy-dewarp
The ocropy-dewarp tool can be used to dewarp text lines of a workspace. It runs the Ocropy baseline estimation and dewarping on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
ocrd-cis-ocropy-dewarp \
--input-file-grp OCR-D-SEG-LINE-BIN \
--output-file-grp OCR-D-SEG-LINE-DEW \
--mets mets.xml
--parameter file:///path/to/config.json
ocrd-cis-ocropy-recognize
The ocropy-recognize tool can be used to recognize lines / words / glyphs from pages of a workspace. It runs the Ocropy optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
ocrd-cis-ocropy-recognize \
--input-file-grp OCR-D-SEG-LINE-DEW \
--output-file-grp OCR-D-OCR-OCRO \
--mets mets.xml
--parameter file:///path/to/config.json
Tesserocr
Install essential system packages for Tesserocr
sudo apt-get install python3-tk \
tesseract-ocr libtesseract-dev libleptonica-dev \
libimage-exiftool-perl libxml2-utils
Then install Tesserocr from: https://github.com/OCR-D/ocrd_tesserocr
pip install -r requirements.txt
pip install .
Download and move tesseract models from: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files or use your own models and place them into: /usr/share/tesseract-ocr/4.00/tessdata
Workflow configuration
A decent pipeline might look like this:
- page-level cropping
- page-level binarization
- page-level deskewing
- page-level dewarping
- region segmentation
- region-level clipping
- region-level deskewing
- line segmentation
- line-level clipping or resegmentation
- line-level dewarping
- line-level recognition
- line-level alignment
If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.
Testing
To run a few basic tests type make test
(ocrd_cis
has to be
installed in order to run any tests).
OCR-D workspace
- Create a new (empty) workspace:
ocrd workspace init workspace-dir
- cd into
workspace-dir
- Add new file to workspace:
ocrd workspace add file -G group -i id -m mimetype
OCR-D links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ocrd_cis-0.0.6.tar.gz
.
File metadata
- Download URL: ocrd_cis-0.0.6.tar.gz
- Upload date:
- Size: 96.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 97aea3f172a5eda7272113eb99d55fddda0a96069a20173ea17563d0532bbd55 |
|
MD5 | 5c8c3934a2a4fe764c112d8fd12a5ffc |
|
BLAKE2b-256 | 8aa91fab502623c41529c13b4ecbedfe224f35843160ddcef4c527a18cfe73b8 |
File details
Details for the file ocrd_cis-0.0.6-py3-none-any.whl
.
File metadata
- Download URL: ocrd_cis-0.0.6-py3-none-any.whl
- Upload date:
- Size: 34.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ac2ada13f48b301831e41cba1e9a86b8e10ac2e8f4036ecdda9eb3524e36461c |
|
MD5 | a186d34dad8d16c13d12af2d0b6d889b |
|
BLAKE2b-256 | f7e05e3953c9243d05859e679bb83bef9c6f08e10fe0eef736fce90bc42657bc |