Evaluate Digitalization Data

These details have not been verified by PyPI

Project links

Homepage

Project description

digital eval

example workflow PyPI - Downloads PyPI - License PyPI - Python Version

Python3 Tool to report evaluation outcomes from mass digitalization workflows.

Features

OCR-D compliant normalized similarity for edit-distance based metrics based on characters, letters and words
choose from textual metrics based on characters or words plus common Information Retrieval
choose from different UTF-8 Python norms
match groundtruth (i.e. reference data) and candidates by filename start
use geometric information to evaluate only specific frame (i.e. specific column or region from large page) of candidates (requires ALTO or PAGE format)
aggregate evaluation outcomes on domain range (with multiple subdomains) according to folder layout
formats: ALTO, PAGE or plain text for both groundtruth and candidates
speedup with parallel execution
additional OCR util:
- filter custom areas of single OCR files of ALTO files

Installation

pip install digital-eval

Usage

Metrics

Edit-Distance similarity

Calculate string similarity for each single reference/groundtruth and test/candidate item. Complete haracter-based text string (Cs, Characters) or Letter-based (Ls, Letters) minus whitespaces, punctuation and common digits (arabic, persian). Word/Token-based edit-distance of single tokens identified by markup elements or whitespaces, depending on data.

Set based

Calculate union of sets of tokens/words (BoW, BagOfWords). Operate on sets of tokens/words with respect to language specific stopwords using nltk -framework for:

Precision (IRPre, Pre, Precision): How many tokens from candidate are in groundtruth reference?
Recall (IRRec, Rec, Recall): How many tokens from groundtruth reference should candidate include?
F-Measure (IRFMeasure, FM): weighted ratio Precision / Recall

UTF-8 Normalization

Use standard Python Implementation of UTF-8 normalizations; default: NFC (cf.:OCR-D spec).

Statistics

Statistics calculated by numpy include arithmetic mean, median. Additionally includes an outlier detection with interquartile range and are shown in relation to the total amount of specific groundtruth/reference (ref) for each metric, i.e. char, letters or tokens/words.

Evaluate treelike structures

To evaluate OCR-candidate-data batch-like versus existing groundtruth, please make sure that your structures fit this layout:

groundtruth root/
├── <domain>/ 
│    └── <subdomain>/
│         └── <page-01>.gt.xml
candidate root/
├── <domain>/ 
│    └── <subdomain>/
│         └── <page-01>.xml

Now call:

digital-eval <path-candidate-root>/domain/ -ref <path-groundtruth>/domain/

for an aggregated overview on stdout. Increase verbosity with -v (or even -vv) to get detailed information about each single data item evaluated.

Structured OCR is considered to contain valid geometrical and textual data on word level, even though for recent PAGE also line level is possible.

Data problems

Inconsistent OCR Groundtruth with empty texts (ALTO String elements missing CONTENT or PAGE without TextEquiv) or invalid geometrical coordinates (less than 3 points or even empty) will lead to evaluation errors if geometry must be respected.

Erroneous data files will be reported and excluded from evaluation.

Additional OCR Utils

Filter Area (ALTO)

You can filter a custom area of a page of an ALTO file by providing the points of an arbitrary shape. The format of the -p, --points argument is <pt_1_x>,<pt_1_y> <pt_2_x>,<pt_2_y> <pt_3_x>,<pt_3_y> ... <pt_n_x>,<pt_n_y> . For simple rectangular areas this can be expressed also with two points, with first point as top left and second point as bottom right: <pt_top_left_x>,<pt_top_left_y> <pt_bottom_right_x>,<pt_bottom_right_y>.

The following example filters a rectangular area of 600x400 pixels of a page, which is described by an input ALTO file and saves the result to an output ALTO file

ocr-util frame -i page_1.alto.xml -p "0,0 600,0 600,400 0,400" -o page_1_area.alto.xml

For plain rectangles exists a short form with only two points, top left and bottom right:

ocr-util frame -i page_1.alto.xml -p "0,0 600,400" -o page_1_area.alto.xml

Development

Plattform: Intel(R) Core(TM) i5-6500 CPU@3.20GHz, 16GB RAM, Ubuntu 22.04 LTS, Python 3.8+

# clone local
git clone <repository-url> <local-dir>
cd <local-dir>

# enable virtual python 3 environment (linux)
# and update pip itself
python3 -m venv venv
. venv/bin/activate
python -m pip install -U pip

# install
python -m pip install -e .

# install additional development dependencies
python -m pip install -r tests/test_requirements.txt

# run tests
python -m pytest -v

Contribute

Contributions, suggestions and proposals welcome!

License

Under terms of the MIT license.

NOTE: This software depends on packages that might be licensed under different terms.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.10.1

Jan 27, 2026

1.9.1

Oct 16, 2025

1.9.0

Oct 7, 2025

This version

1.8.0

Oct 2, 2025

1.6.0

May 24, 2024

1.5.3

Jun 14, 2023

1.5.2

Jun 7, 2023

1.5.1

Apr 17, 2023

1.2.1

Nov 17, 2022

1.2.0

Nov 16, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digital_eval-1.8.0.tar.gz (47.5 kB view details)

Uploaded Oct 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

digital_eval-1.8.0-py3-none-any.whl (39.8 kB view details)

Uploaded Oct 2, 2025 Python 3

File details

Details for the file digital_eval-1.8.0.tar.gz.

File metadata

Download URL: digital_eval-1.8.0.tar.gz
Upload date: Oct 2, 2025
Size: 47.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for digital_eval-1.8.0.tar.gz
Algorithm	Hash digest
SHA256	`3c7c86c00b8e5791aaa3ddd62eb5bd6ce4eb1167eb8a05ce48d6dcf468888343`
MD5	`107cd12251a059d410aa4884d50a5942`
BLAKE2b-256	`4d5eec60bb5b0f2444dae198e4aedfd9b1a12d80763712eff0d810e0fd08323b`

See more details on using hashes here.

File details

Details for the file digital_eval-1.8.0-py3-none-any.whl.

File metadata

Download URL: digital_eval-1.8.0-py3-none-any.whl
Upload date: Oct 2, 2025
Size: 39.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for digital_eval-1.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`33b3ad0a774f28de814dbefd638f4487432c43ba717fc9e2ccc564f72713abb6`
MD5	`353d9ec06509632efbc1b58cd22725f6`
BLAKE2b-256	`fa32869fb09d4996de13828e62b2d75003ccef3cd8cf3acf73e012a8993216e1`

See more details on using hashes here.

digital-eval 1.8.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

digital eval

Features

Installation

Usage

Metrics

Edit-Distance similarity

Set based

UTF-8 Normalization

Statistics

Evaluate treelike structures

Data problems

Additional OCR Utils

Filter Area (ALTO)

Development

Contribute

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes