Skip to main content

Collection of utilities for Tesseract OCR training

Project description

Python tools for Tesseract OCR training

Training tools for Tesseract OCR.

Installation

Install using pip:

pip install pytesstrain

This will also install Python packages pytesseract (used for running Tesseract) and editdistance (used for calculation of error rates).

Getting started

This package contains tools for specific problems:

text2image is crashing (issue #1781 @ Tesseract OCR)

The text2image tool crashes, if text lines are too long. As stated in the issue above, rewrapping text lines to smaller length is the official workaround for this problem. For example, to reduce line length to 35 characters at most, run

rewrap corpus.txt corpus-35.txt 35

Creating dictionary data from corpus file

In case you do not have a dictionary file for the training language, you might want to create one from the corpus file. To create dictionary file for the language lang, run

create_dictdata -l lang -i corpus.txt -d ./langdata/lang

This tool creates following files:

  • lang.training_text (copy of the corpus file)
  • lang.wordlist (dictionary)
  • lang.word.bigrams (word bigrams)
  • lang.training_text.bigram_freqs (character bigram frequencies)
  • lang.training_text.unigram_freqs (character frequencies)

The file lang.wordlist.freq is usually created by training tools, such as tesstrain.sh and the likewise, so there is no need to create it with create_dictdata.

Language metrics

The tool language_metrics runs Tesseract OCR over images of random word sequences, which are created out of the supplied wordlist, and calculates median metrics (currently CER and WER) from the results. It enables you to assess the quality of your .traineddata file.

To calculate metrics for the language lang with fonts Arial and Courier using wordlist file lang.wordlist, run

language_metrics -l lang -w lang.wordlist --fonts Arial,Courier

Creating unicharambigs file

There are two tools in this package, which enable automatic creation of an unicharambigs file.

The first tool, collect_ambiguities, compares the recognised text with the reference text and extracts smallest possible differences as error and correction pairs, and stores them sorted by frequency of occurrence in a JSON file. You may look at the ambiguities by yourself before converting them to unicharambigs file with the second tool.

The second tool, json2unicharambigs, takes the intermediate JSON file and puts the ambiguities into the unicharambigs file. The resulting file has v2 format. You may limit the ambiguities, which go into the unicharambigs file, with additional command-line switches.

To create the file lang.unicharambigs for the language lang using wordlist file lang.wordlist, run

collect_ambiguities -l lang -w lang.wordlist --fonts Arial,Courier -o ambigs.json
json2unicharambigs --mode safe --mandatory_only ambigs.json lang.unicharambigs

API Reference

The main workhorse is the function pytesstrain.train.run_test. There is also a parallel version, pytesstrain.train.run_tests, which uses a pool of threads to run the former function on multiple processors simultaneously (using threads instead of processes for parallelisation is possible, because the run_test function starts processes itself and is thus I/O-bound).

The subpackage tesseract simply imports the package pytesseract. The subpackage text2image imitates the former one, but for the text2image tool instead of tesseract.

The subpackage metrics contains implementation of metrics, such as CER and WER. The subpackage utils has often-used, miscellaneous functions and the subpackage ambigs contains ambiguity processing functions.

Finally, the subpackage cli contains the console scripts.

License

Pytesstrain is released under Apache License 2.0.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytesstrain-0.1.2.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

pytesstrain-0.1.2-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file pytesstrain-0.1.2.tar.gz.

File metadata

  • Download URL: pytesstrain-0.1.2.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.5.4

File hashes

Hashes for pytesstrain-0.1.2.tar.gz
Algorithm Hash digest
SHA256 98f48411c46baa5c43ac2f72d8b2c7ad7ac221763dc0f747a51194189cf23e49
MD5 89463e260a723905df6e3e2e40ffe4c6
BLAKE2b-256 58d176bdc1407c1837b1e6668522c21e0c5c2401ddea7f331e8bc35372d97e0b

See more details on using hashes here.

File details

Details for the file pytesstrain-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: pytesstrain-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.5.4

File hashes

Hashes for pytesstrain-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0fd8ca1da1df45690bdbf7774c1e93bbe048fa0304af47fd9152d333fa7cfca9
MD5 53be11d21e8dd7a65120ea5d6dbe9f98
BLAKE2b-256 7fd9bb4c8b48655df48aa7e4a39a0800c51032ceacbfee1d41ac2f1f0418e5ee

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page