Collection of utilities for Tesseract OCR training
Project description
Python tools for Tesseract OCR training
Training tools for Tesseract OCR.
Installation
Install using pip:
pip install pytesstrain
This will also install Python packages pytesseract
(used for running Tesseract)
and editdistance
(used for calculation of error rates).
Getting started
This package contains tools for specific problems:
text2image is crashing (issue #1781 @ Tesseract OCR)
The text2image tool crashes, if text lines are too long. As stated in the issue above, rewrapping text lines to smaller length is the official workaround for this problem. For example, to reduce line length to 35 characters at most, run
rewrap corpus.txt corpus-35.txt 35
Creating dictionary data from corpus file
In case you do not have a dictionary file for the training language, you might want to create one from the corpus file. To create dictionary file for the language lang, run
create_dictdata -l lang -i corpus.txt -d ./langdata/lang
This tool creates following files:
- lang.training_text (copy of the corpus file)
- lang.wordlist (dictionary)
- lang.word.bigrams (word bigrams)
- lang.training_text.bigram_freqs (character bigram frequencies)
- lang.training_text.unigram_freqs (character frequencies)
The file lang.wordlist.freq
is usually created by training tools, such as tesstrain.sh
and the likewise,
so there is no need to create it with create_dictdata
.
Language metrics
The tool language_metrics
runs Tesseract OCR over images of random word sequences, which are created
out of the supplied wordlist, and calculates median metrics (currently CER and WER) from the results.
It enables you to assess the quality of your .traineddata
file.
To calculate metrics for the language lang with fonts Arial and Courier using wordlist file lang.wordlist, run
language_metrics -l lang -w lang.wordlist --fonts Arial,Courier
Creating unicharambigs file
There are two tools in this package, which enable automatic creation of an unicharambigs file.
The first tool, collect_ambiguities
, compares the recognised text with the reference text and
extracts smallest possible differences as error and correction pairs, and stores them sorted by
frequency of occurrence in a JSON file. You may look at the ambiguities by yourself before
converting them to unicharambigs
file with the second tool.
The second tool, json2unicharambigs
, takes the intermediate JSON file and puts the ambiguities
into the unicharambigs
file. The resulting file has v2 format. You may limit the ambiguities,
which go into the unicharambigs
file, with additional command-line switches.
To create the file lang.unicharambigs
for the language lang using wordlist file lang.wordlist,
run
collect_ambiguities -l lang -w lang.wordlist --fonts Arial,Courier -o ambigs.json
json2unicharambigs --mode safe --mandatory_only ambigs.json lang.unicharambigs
API Reference
The main workhorse is the function pytesstrain.train.run_test
. There is also a parallel version,
pytesstrain.train.run_tests
, which uses a pool of threads to run the former function on multiple processors
simultaneously (using threads instead of processes for parallelisation is possible, because the run_test
function starts processes itself and is thus I/O-bound).
The subpackage tesseract
simply imports the package pytesseract
. The subpackage text2image
imitates
the former one, but for the text2image
tool instead of tesseract
.
The subpackage metrics
contains implementation of metrics, such as CER and WER. The subpackage utils
has
often-used, miscellaneous functions and the subpackage ambigs
contains ambiguity processing functions.
Finally, the subpackage cli
contains the console scripts.
License
Pytesstrain is released under Apache License 2.0.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pytesstrain-0.1.3.tar.gz
.
File metadata
- Download URL: pytesstrain-0.1.3.tar.gz
- Upload date:
- Size: 12.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.5.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 106c189eec1cf745c813dd1383cd974c26006ed36899bde3182098fd262d5ed9 |
|
MD5 | 1b739434633a516c4e285b5136de82c0 |
|
BLAKE2b-256 | 430da69880e24bafb0a6275cdebc50922831ad9d560bd909623cbd98ffb2118a |
File details
Details for the file pytesstrain-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: pytesstrain-0.1.3-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.5.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9c544d38f0fc580ab68733363942708b7965a86bcac20527f3faca6a8b9c2139 |
|
MD5 | 0a123f618cb1297b6da301ffbf48448b |
|
BLAKE2b-256 | 154644f467cd1e10ea9c7fba723289fb70554950df05887375a4f306e48884ae |