Skip to main content

An OCR evaluation tool

Project description

dinglehopper

dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR result page to compute metrics and a word/character differences report. It also supports batch processing by generating, aggregating and summarizing multiple reports.

Tests GitHub tag License issues - dinglehopper

Goals

  • Useful
    • As a UI tool
    • For an automated evaluation
    • As a library
  • Unicode support

Installation

It's best to use pip to install the package from PyPI, e.g.:

pip install dinglehopper

Usage

Usage: dinglehopper [OPTIONS] GT OCR [REPORT_PREFIX] [REPORTS_FOLDER]

  Compare the PAGE/ALTO/text document GT against the document OCR.

  dinglehopper detects if GT/OCR are ALTO or PAGE XML documents to extract
  their text and falls back to plain text if no ALTO or PAGE is detected.

  The files GT and OCR are usually a ground truth document and the result of
  an OCR software, but you may use dinglehopper to compare two OCR results. In
  that case, use --no-metrics to disable the then meaningless metrics and also
  change the color scheme from green/red to blue.

  The comparison report will be written to
  $REPORTS_FOLDER/$REPORT_PREFIX.{html,json}, where $REPORTS_FOLDER defaults
  to the current working directory and $REPORT_PREFIX defaults to "report".
  The reports include the character error rate (CER) and the word error rate
  (WER).

  By default, the text of PAGE files is extracted on 'region' level. You may
  use "--textequiv-level line" to extract from the level of TextLine tags.

Options:
  --metrics / --no-metrics  Enable/disable metrics and green/red
  --differences BOOLEAN     Enable reporting character and word level
                            differences
  --textequiv-level LEVEL   PAGE TextEquiv level to extract text from
  --progress                Show progress bar
  --help                    Show this message and exit.

For example:

dinglehopper some-document.gt.page.xml some-document.ocr.alto.xml

This generates report.html and report.json.

dinglehopper displaying metrics and character differences

Batch comparison between folders of GT and OCR files can be done by simply providing folders:

dinglehopper gt/ ocr/ report output_folder/

This assumes that you have files with the same name in both folders, e.g. gt/00000001.page.xml and ocr/00000001.alto.xml.

The example generates reports for each set of files, with the prefix report, in the (automatically created) folder output_folder/.

By default, the JSON report does not contain the character and word differences, only the calculated metrics. If you want to include the differences, use the --differences flag:

dinglehopper gt/ ocr/ report output_folder/ --differences

dinglehopper-summarize

A set of (JSON) reports can be summarized into a single set of reports. This is useful after having generated reports in batch. Example:

dinglehopper-summarize output_folder/

This generates summary.html and summary.json in the same output_folder.

If you are summarizing many reports and have used the --differences flag while generating them, it may be useful to limit the number of differences reported by using the --occurrences-threshold parameter. This will reduce the size of the generated HTML report, making it easier to open and navigate. Note that the JSON report will still contain all differences. Example:

dinglehopper-summarize output_folder/ --occurrences-threshold 10

dinglehopper-line-dirs

You also may want to compare a directory of GT text files (i.e. gt/line0001.gt.txt) with a directory of OCR text files (i.e. ocr/line0001.some-ocr.txt) with a separate CLI interface:

dinglehopper-line-dirs gt/ ocr/

The CLI dinglehopper-line-dirs can also work with GT text files in the same directories as the the OCR text files. You should read dinglehopper-line-dirs --help in this case.

dinglehopper-extract

The tool dinglehopper-extract extracts the text of the given input file on stdout, for example:

dinglehopper-extract --textequiv-level line OCR-D-GT-PAGE/00000024.page.xml

OCR-D

As a OCR-D processor:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-TESS -O OCR-D-OCR-TESS-EVAL

This generates HTML and JSON reports in the OCR-D-OCR-TESS-EVAL filegroup.

The OCR-D processor has these parameters:

Parameter Meaning
-P metrics false Disable metrics and the green-red color scheme (default: enabled)
-P textequiv_level line (PAGE) Extract text from TextLine level (default: TextRegion level)

For example:

ocrd-dinglehopper -I ABBYY-FULLTEXT,OCR-D-OCR-CALAMARI -O OCR-D-OCR-COMPARE-ABBYY-CALAMARI -P metrics false

Developer information

Please refer to README-DEV.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dinglehopper-0.11.0.tar.gz (45.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dinglehopper-0.11.0-py3-none-any.whl (55.9 kB view details)

Uploaded Python 3

File details

Details for the file dinglehopper-0.11.0.tar.gz.

File metadata

  • Download URL: dinglehopper-0.11.0.tar.gz
  • Upload date:
  • Size: 45.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dinglehopper-0.11.0.tar.gz
Algorithm Hash digest
SHA256 3adb66dd3c9bda24c62d588917e88b09a836a5006cb42b71a6745ef9e17de47d
MD5 a7a0f47b72d010d7105c8c30bdaaac97
BLAKE2b-256 7de11dcbb99d5c51ed51669b3ec4e11acf219dfaa77bedb086507a70e4a99258

See more details on using hashes here.

Provenance

The following attestation bundles were made for dinglehopper-0.11.0.tar.gz:

Publisher: release.yml on qurator-spk/dinglehopper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dinglehopper-0.11.0-py3-none-any.whl.

File metadata

  • Download URL: dinglehopper-0.11.0-py3-none-any.whl
  • Upload date:
  • Size: 55.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for dinglehopper-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8d1bf203258801ec53151370a425924634bb31093b30407844d0c7a508836870
MD5 5b26fa1b12d58b9f05ed3de3e619f892
BLAKE2b-256 e42a8e3686d2329e53c590aaf1f2be8d0f2eef0db5ca6c89088e31e5de6347de

See more details on using hashes here.

Provenance

The following attestation bundles were made for dinglehopper-0.11.0-py3-none-any.whl:

Publisher: release.yml on qurator-spk/dinglehopper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page