Skip to main content

Collection of utilities for Tesseract OCR training

Project description

Python utilities for Tesseract OCR training

This module is a collection of different training utilities for Tesseract OCR. These utilities are also implemented as console scripts, hence they can be run from command line.

Utilities

All utilities list their command line switches when run with the switch --help.

  • rewrap just rewraps text lines by specified maximal line length
  • create_dictdata creates all word- and n-gram-lists from a text file, which are translated to DAWGs and added to the traineddata file then
  • language_metrics creates random texts from supplied wordlist and tests for recognition error rates
  • collect_ambiguities extracts error-correction pairs from reference-hypothesis pairs and stores them in a JSON file
  • json2unicharambigs stores specified error-correction pairs from JSON file in a unicharambigs file

Requirements

This module requires the following modules to work:

  • pytesseract (Running Tesseract OCR)
  • editdistance (Calculation of error rates)

Packages

The module is split in several packages. The package pytesstrain.train contains the workhorse function run_text(). The package pytesstrain.cli contains the utilities you might run at the command line. The package pytesstrain.ambigs contains function around unicharambigs file. The package pytesstrain.text2image contains the interface to the text2image command from the Tesseract OCR; the interface relies on pytesseract module and is modelled after it as well. The package pytesstrain.metrics contains error rate calculations, as well the interface class Metrics. The package pytesstrain.utils contains auxiliary functions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytesstrain-0.1.1.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

pytesstrain-0.1.1-py3-none-any.whl (20.5 kB view details)

Uploaded Python 3

File details

Details for the file pytesstrain-0.1.1.tar.gz.

File metadata

  • Download URL: pytesstrain-0.1.1.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.5.4

File hashes

Hashes for pytesstrain-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6f3e0433f4b86dc6663f02b809490b7082043879bdd94a3eec0a8075020313ea
MD5 f7f4b2cf8adc0b113f35e6cfa426b20c
BLAKE2b-256 2e4cfa3dcb279aa82946e56ccf9acfc3557e39c2aa6090fb0949fa1a362b9a7f

See more details on using hashes here.

File details

Details for the file pytesstrain-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pytesstrain-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.4.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.5.4

File hashes

Hashes for pytesstrain-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6a16e4d1167462dee6137faf536fd4ed6fc7aeef2e4e534471eaa0f9d4c63b2e
MD5 7873d5b782ba1e82cf2a14f6061849dd
BLAKE2b-256 9e20e180ae421551259c05a42ddfaf825a06b184ce63610bfa423b88cb232779

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page