Skip to main content

Evaluate Mass Digitalization Data

Project description

digital eval

example workflow

Python3 Tool to report evaluation outcomes from mass digitalization workflows.

Features

  • match automatically groundtruth (i.e. reference data) and candidates by filename
  • use geometric information to evaluate only specific frame (i.e. specific column or region from large page) of candidates (requires ALTO or PAGE format)
  • aggregate evaluation outcome on domain range (with multiple subdomains)
  • choose from textual metrics based on characters or words plus common Information Retrieval
  • choose between accuracy / error rate and different UTF-8 Python norms
  • formats: ALTO, PAGE or plain text for both groundtruth and candidates
  • speedup with parallel execution
  • additional OCR util:
    • filter custom areas of single OCR files

Installation

pip install digital-eval

Usage

Metrics

Calculate similarity (acc) or difference (err) ratios between single reference/groundtruth and test/candidate item.

Edit-Distance based

Character-based text string minus whitechars (Cs, Characters) or Letter-based (Ls, Letters) minus whites, punctuation and digits. Word/Token-based edit-distance of single tokens identified by whitespaces.

Set based

Calculate union of sets of tokens/words (BoW, BagOfWords). Operate on sets of tokens/words with respect to language specific stopwords using nltk -framework for:

  • Precision (IRPre, Pre, Precision): How many tokens from candidate are in groundtruth reference?
  • Recall (IRRec, Rec, Recall): How many tokens from groundtruth reference should candidate include?
  • F-Measure (IRFMeasure, FM): weighted ratio Precision / Recall

UTF-8 Normalisations

Use standard Python Implementation of UTF-8 normalizations; default: NFKD.

Statistics

Statistics calculated via numpy include arithmetic mean, median and outlier detection with interquartile range and are based on the specific groundtruth/reference (ref) for each metric, i.e. char, letters or tokens.

Evaluate treelike structures

To evaluate OCR-candidate-data batch-like versus existing Groundtruth, please make sure that your structures fit this way:

groundtruth root/
├── <domain>/     └── <subdomain>/
│         └── <page-01>.gt.xml
candidate root/
├── <domain>/     └── <subdomain>/
│         └── <page-01>.xml

Now call via:

digital-eval <path-candidate-root>/domain/ -ref <path-groundtruth>/domain/

for an aggregated overview on stdout. Feel free to increase verbosity via -v (or even -vv) to get detailed information about each single data set which was evaluated.

Structured OCR is considered to contain valid geometrical and textual data on word level, even though for recent PAGE also line level is possible.

Data problems

Inconsistent OCR Groundtruth with empty texts (ALTO String elements missing CONTENT or PAGE without TextEquiv) or invalid geometrical coordinates (less than 3 points or even empty) will lead to evaluation errors if geometry must be respected.

Additional OCR Utils

Filter Area

You can filter a custom area of a page of an OCR file by providing the points of an arbitrary shape. The format of the -p, --points argument is <pt_1_x>,<pt_1_y> <pt_2_x>,<pt_2_y> <pt_3_x>,<pt_3_y> ... <pt_n_x>,<pt_n_y> . For simple rectangular areas this can be expressed also with two points, with first point as top left and second point as bottom right: <pt_top_left_x>,<pt_top_left_y> <pt_bottom_right_x>,<pt_bottom_right_y>.

The following example filters a rectangular area of 600x400 pixels of a page, which is described by an input ALTO file and saves the result to an output ALTO file

ocr-util frame -i page_1.alto.xml -p "0,0 600,0 600,400 0,400" -o page_1_area.alto.xml

Short version with top left and bottom right:

ocr-util frame -i page_1.alto.xml -p "0,0 600,400" -o page_1_area.alto.xml

Development

Plattform: Intel(R) Core(TM) i5-6500 CPU@3.20GHz, 16GB RAM, Ubuntu 20.04 LTS, Python 3.8.

# clone local
git clone <repository-url> <local-dir>
cd <local-dir>

# enable virtual python environment (linux)
# and install libraries
python3 -m venv venv
. venv/bin/activate
python -m pip install -U pip
python -m pip install -r requirements.txt

# install
pip install .

# optional:
# install additional development dependencies
pip install -r tests/test_requirements.txt
pytest -v

# run
digital-eval --help

Contribute

Contributions, suggestions and proposals welcome!

Licence

Under terms of the MIT license.

NOTE: This software depends on other packages that may be licensed under different open source licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digital-eval-1.5.1.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

digital_eval-1.5.1-py3-none-any.whl (33.2 kB view details)

Uploaded Python 3

File details

Details for the file digital-eval-1.5.1.tar.gz.

File metadata

  • Download URL: digital-eval-1.5.1.tar.gz
  • Upload date:
  • Size: 32.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.13

File hashes

Hashes for digital-eval-1.5.1.tar.gz
Algorithm Hash digest
SHA256 526216bcbec934763e1aa8172132618721b21ca50b522e0461e6590e09193e16
MD5 858a024dc9bca3689db5c6b8645d39eb
BLAKE2b-256 67386ba527e4360eb61ee99b5f4c9dc4028c55b72d80be18ffa5097682ceb16a

See more details on using hashes here.

File details

Details for the file digital_eval-1.5.1-py3-none-any.whl.

File metadata

  • Download URL: digital_eval-1.5.1-py3-none-any.whl
  • Upload date:
  • Size: 33.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.13

File hashes

Hashes for digital_eval-1.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e79e33b00c02172b248effbfca000a7c4d8e5a7557882c4da438c9fe432af09e
MD5 a2d56475f79aeae99ca18119e7f6cbcc
BLAKE2b-256 42155d05a3f01bfb6070c61a2462ceffd05c7e423512bf44bd339a2230649496

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page