Skip to main content

The OCR evaluation tool

Project description

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report.

Build Status

Goals

  • Useful
    • As a UI tool
    • For an automated evaluation
    • As a library
  • Unicode support

Installation

It's best to use pip, e.g.:

sudo pip install .

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results.
  In that case, use --no-metrics to disable the then meaningless metrics and
  also change the color scheme from green/red to blue.

  The comparison report will be written to $REPORT_PREFIX.{html,json}, where
  $REPORT_PREFIX defaults to "report". The reports include the character
  error rate (CER) and the word error rate (WER).

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

dinglehopper displaying metrics and character differences

dinglehopper-extract

The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml

OCR-D

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter Meaning
-P metrics false Disable metrics and the green-red color scheme (default: enabled)
-P textequiv_level line (PAGE) Extract text from TextLine level (default: TextRegion level)

For example:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false

Developer information

Please refer to README-DEV.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd-fork-dinglehopper-0.0.1.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

ocrd_fork_dinglehopper-0.0.1-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file ocrd-fork-dinglehopper-0.0.1.tar.gz.

File metadata

  • Download URL: ocrd-fork-dinglehopper-0.0.1.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for ocrd-fork-dinglehopper-0.0.1.tar.gz
Algorithm Hash digest
SHA256 ef93f47570fab14dae828b3f453f891bcc5938ff752efd3115653676fddc33d0
MD5 0f5e8860083f89cff4b063ac6234cf90
BLAKE2b-256 3cfc6a90c19039f1add96aa0aefd01806fa29b5d377592bc3fb047adccf0ac0c

See more details on using hashes here.

File details

Details for the file ocrd_fork_dinglehopper-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: ocrd_fork_dinglehopper-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.6.9

File hashes

Hashes for ocrd_fork_dinglehopper-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 125b98faabc8931a1523fb8d5464c79b9233067ec94547fd132beb9a0cf7d1d3
MD5 23915557430c7afa47c25a8a502f7fd1
BLAKE2b-256 0d5e056511cf60f75182d7d124be0b4a98053705e13e15f13cc8b61c0a86f03c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page