Skip to main content

Evaluate Digitalization Data

Project description

digital eval

example workflow PyPi version PyPI - Downloads PyPI - License PyPI - Python Version

Python3 Tool evaluation for the masses - evaluate outcomes from mass digitalization workflows to reference data with 1.000+ pages in resonable time.

Features

  • OCR-D compliant normalized similarity for edit-distance based metrics based on characters, letters and words
  • choose from textual metrics based on characters or words plus common Information Retrieval
  • choose from different UTF-8 Python norms
  • match groundtruth (i.e. reference data) and candidates by filename within local filesystems
  • use geometric information to consider specific areas (i.e. certain region) of reference / candidate Please note: requires both reference and ground truth to include geometrical information (ALTO / PAGE XML format)
  • includes descriptive statistics for different aggregation levels
  • formats: ALTO, PAGE or plain text for both groundtruth and candidates
  • speedup with parallel execution
  • additional OCR util:
    • filter custom area of ALTO file

Requirements

  • recent *nix-OS
  • Python3.10 Environment

Installation

pip install digital-eval

Usage

Metrics

Edit-Distance similarity

Calculate string similarity for each single reference/groundtruth and test/candidate item. Complete haracter-based text string (Cs, Characters) or Letter-based (Ls, Letters) minus whitespaces, punctuation and common digits (arabic, persian). Word/Token-based edit-distance of single tokens identified by markup elements or whitespaces, depending on data.

Set based

Calculate union of sets of tokens/words (BoW, BagOfWords). Operate on sets of tokens/words with respect to language specific stopwords using nltk -framework for:

  • Precision (IRPre, Pre, Precision): How many tokens from candidate are in groundtruth reference?
  • Recall (IRRec, Rec, Recall): How many tokens from groundtruth reference should candidate include?
  • F-Measure (IRFMeasure, FM): weighted ratio Precision / Recall

UTF-8 Normalization

Use standard Python Implementation of UTF-8 normalizations; default: NFC (cf.:OCR-D spec).

Statistics

Statistics calculated by numpy include arithmetic mean, median.

Includes outlier detection with interquartile range.

Shows differences Characters vs. Letters and Words vs. Bag of Words to illustrate impact of Character classes and Layout detection.

Data problems

Inconsistent OCR Groundtruth with empty texts (ALTO String elements missing CONTENT or PAGE without TextEquiv) or invalid geometrical coordinates (less than 3 points or even empty) will lead to evaluation errors if geometry must be respected.

Erroneous data files will be reported and excluded from evaluation.

Additional OCR Utils

Filter Area (ALTO)

You can filter a custom area of a page of an ALTO file by providing the points of an arbitrary shape. The format of the -p, --points argument is <pt_1_x>,<pt_1_y> <pt_2_x>,<pt_2_y> <pt_3_x>,<pt_3_y> ... <pt_n_x>,<pt_n_y> . For simple rectangular areas this can be expressed also with two points, with first point as top left and second point as bottom right: <pt_top_left_x>,<pt_top_left_y> <pt_bottom_right_x>,<pt_bottom_right_y>.

The following example filters a rectangular area of 600x400 pixels of a page, which is described by an input ALTO file and saves the result to an output ALTO file

ocr-util frame -i page_1.alto.xml -p "0,0 600,0 600,400 0,400" -o page_1_area.alto.xml

For plain rectangles exists a short form with only two points, top left and bottom right:

ocr-util frame -i page_1.alto.xml -p "0,0 600,400" -o page_1_area.alto.xml

Development

Plattform: Intel(R) Core(TM) i5-6500 CPU@3.20GHz, 16GB RAM, Ubuntu 22.04 LTS, Python 3.10+

# clone local
git clone <repository-url> <local-dir>
cd <local-dir>

# enable virtual python 3 environment (linux)
# and update pip itself
python3 -m venv venv
. venv/bin/activate
python -m pip install -U pip

# install
python -m pip install -e .

# install additional development dependencies
python -m pip install -r tests/test_requirements.txt

# run tests
python -m pytest -v

Contribute

Contributions, suggestions and proposals welcome!

License

Under terms of the MIT license.

NOTE: This software depends on packages that might be licensed under different terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

digital_eval-1.10.1.tar.gz (49.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

digital_eval-1.10.1-py3-none-any.whl (40.6 kB view details)

Uploaded Python 3

File details

Details for the file digital_eval-1.10.1.tar.gz.

File metadata

  • Download URL: digital_eval-1.10.1.tar.gz
  • Upload date:
  • Size: 49.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for digital_eval-1.10.1.tar.gz
Algorithm Hash digest
SHA256 f71f2da424733c5334bc0bd5106efec68ce10b36b0efab2847d51d9ef1c4ae7f
MD5 4bd12f1a626b869b078082bc661bc753
BLAKE2b-256 5cefaeadd35e5e4815b54ba5472cbd21fd6f2d54644e1ff7f3346147b01fa299

See more details on using hashes here.

File details

Details for the file digital_eval-1.10.1-py3-none-any.whl.

File metadata

  • Download URL: digital_eval-1.10.1-py3-none-any.whl
  • Upload date:
  • Size: 40.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for digital_eval-1.10.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6af42c9e9ce21b0a042fb371cb200421f33e111ff44b3521db2f30d35fde7293
MD5 9ed5b214d6820a5536174bc480cd835d
BLAKE2b-256 f82b54808bea7a29fae33dbb229017e584de8f97cf62389457a55c3d9f686bcd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page