Evaluate Mass Digitalization Data
Project description
digital eval
Python3 Tool to report evaluation outcomes from mass digitalization workflows.
Features
- match automatically groundtruth (i.e. reference data) and candidates by filename
- use geometric information to evaluate only specific frame (i.e. specific column or region from large page) of candidates (requires ALTO or PAGE format)
- aggregate evaluation outcome on domain range (with multiple subdomains)
- choose from textual metrics based on characters or words plus common Information Retrieval
- choose between accuracy / error rate and different UTF-8 Python norms
- formats: ALTO, PAGE or plain text for both groundtruth and candidates
- speedup with parallel execution
Installation
pip install digital-eval
Usage
Metrics
Calculate similarity (acc
) or difference (err
) ratios between single reference/groundtruth and test/candidate item.
Edit-Distance based
Character-based text string minus whitechars (Cs
, Characters
) or Letter-based (Ls
, Letters
) minus whites, punctuation and digits.
Word/Token-based edit-distance of single tokens identified by whitespaces.
Set based
Calculate union of sets of tokens/words (BoW
, BagOfWords
).
Operate on sets of tokens/words with respect to language specific stopwords using nltk-framework for:
- Precision (
IRPre
,Pre
,Precision
): How many tokens from candidate are in groundtruth reference? - Recall (
IRRec
,Rec
,Recall
): How many tokens from groundtruth reference should candidate include? - F-Measure (
IRFMeasure
,FM
): weighted ratio Precision / Recall
UTF-8 Normalisations
Use standard Python Implementation of UTF-8 normalizations; default: NFKD
.
Statistics
Statistics calculated via numpy include arithmetic mean, median and outlier detection with interquartile range and are based on the specific groundtruth/reference (ref) for each metric, i.e. char, letters or tokens.
Evaluate treelike structures
To evaluate OCR-candidate-data batch-like versus existing Groundtruth, please make sure that your structures fit this way:
groundtruth root/
├── <domain>/
│ └── <subdomain>/
│ └── <page-01>.gt.xml
candidate root/
├── <domain>/
│ └── <subdomain>/
│ └── <page-01>.xml
Now call via:
digital-eval <path-candidate-root>/domain/ -ref <path-groundtruth>/domain/
for an aggregated overview on stdout. Feel free to increase verbosity via -v
(or even -vv
) to get detailed information about each single data set which was evaluated.
Structured OCR is considered to contain valid geometrical and textual data on word level, even though for recent PAGE also line level is possible.
Data problems
Inconsistent OCR Groundtruth with empty texts (ALTO String elements missing CONTENT or PAGE without TextEquiv) or invalid geometrical coordinates (less than 3 points or even empty) will lead to evaluation errors if geometry must be respected.
Development
Plattform: Intel(R) Core(TM) i5-6500 CPU@3.20GHz, 16GB RAM, Ubuntu 20.04 LTS, Python 3.8.
# clone local
git clone <repository-url> <local-dir>
cd <local-dir>
# enable virtual python environment (linux)
# and install libraries
python3 -m venv venv
. venv/bin/activate
python -m pip install -U pip
python -m pip install -r requirements.txt
# install
pip install .
# optional:
# install additional development dependencies
pip install -r tests/test_requirements.txt
pytest -v
# run
digital-eval --help
Contribute
Contributions, suggestions and proposals welcome!
Licence
Under terms of the MIT license.
NOTE: This software depends on other packages that may be licensed under different open source licenses.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file digital-eval-1.2.1.tar.gz
.
File metadata
- Download URL: digital-eval-1.2.1.tar.gz
- Upload date:
- Size: 26.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 09306b90cec9f9b1149a6ec958e8af11c4c7c30051a8ee07963142b901285240 |
|
MD5 | 264b1d8f34d63832e972c5df13353d6d |
|
BLAKE2b-256 | e82b3c93fd6a5964f99030ff60b72bedf90586e83fcedede79dcc5368fcf341a |
File details
Details for the file digital_eval-1.2.1-py3-none-any.whl
.
File metadata
- Download URL: digital_eval-1.2.1-py3-none-any.whl
- Upload date:
- Size: 27.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.8.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b15d9af92cdf32037f6791acdc92070067fbe33e368a8a65c604880bcebd3f8f |
|
MD5 | 5e1cae78629f68355a007dda2690afe6 |
|
BLAKE2b-256 | 9edcdefb7dcebe78990df745dc3023a78340812b1143f38e504728c551fae26c |