Python module for evaluating OCR post-correction

These details have not been verified by PyPI

Project links

Project description

HIPE-OCRepair-scorer

The HIPE-OCRepair-scorer is a Python module for evaluating OCR post-correction.

It is developed and used in the context of the HIPE-OCRepair-2026 ICDAR Competition on OCR post-correction for historical documents, which is part of the broader HIPEval (Historical Information Processing Evaluation) initiative, a series of shared tasks on historical document processing.

Related repositories and websites:

HIPE-OCRepair: Website of the competition hosted at ICDAR-2026
HIPE-OCRepair-2026-data: public data releases (training, validation and test sets) for the HIPE-OCRepair-2026 shared task.
HIPE-OCRepair-2026-eval: for the Hugging Face leaderboard

Release history

Feb 2026: v0.9, initial release of the OCR post-correction scorer
March 2026: Release on pypi.

Main functionalities | Input format, scorer entry points, and naming conventions | Installation and usage | About

Main functionalities 📊

The scorer evaluates OCR post-correction outputs against ground-truth transcriptions. It computes match error rates at character and word level (cMER/wMER) as well as preference metrics that compare the post-correction output to the raw OCR hypothesis.

Metrics

All metrics are based on Match Error Rate (MER), computed as:

$$\text{MER} = \frac{S + D + I}{H + S + D + I}$$

where H = hits, S = substitutions, D = deletions, I = insertions. Unlike standard CER/WER, MER is capped in [0, 1] because insertions are included in the denominator. This reduces sensitivity to extreme hallucinations while remaining easy to interpret. MER is equivalent to the normalized CER in the sense of the OCR-D evaluation spec (see https://ocr-d.de/en/spec/ocrd_eval.html#character-error-rate-cer).

Primary metrics

cMER (character-level MER, micro-averaged): corpus-level character match error rate, the main evaluation metric. Micro-averaged so longer documents contribute more than shorter ones.
Preference score (macro average): a simple sign-based metric computed per input document and then averaged unweighted across documents. For each item we calculate two cMER scores, one compares the corrected hypothesis against the gold, and the other compares the hypothesis against the gold. This yields 1 (improved), 0 (tied), or -1 (worse) and captures how consistently a system improves over the input, while cMER on the micro-level (see above) captures the magnitude of improvement.

Additional metrics

wMER (word-level MER): reported for completeness, but cMER is preferred in historical OCR due to spelling variation and transcription conventions.
Confidence intervals: computed for all measures to ensure statistical robustness.

Normalization and stratification

Before scoring, text is normalized as follows:

Case-folded to lowercase
Unicode letters and digits are kept (including accented characters such as é, ç, ü)
All other characters (punctuation, symbols) are replaced with space
Whitespace is collapsed

This means evaluation is case-insensitive and punctuation-insensitive, but sensitive to accented characters (é ≠ e).

Reference records can also opt out of scoring by setting ground_truth.exclude_from_icdar_evaluation to true. Such documents are excluded from evaluation entirely and reported on stderr.

Results can be stratified by dataset or any user-defined mapping.

Input format, scorer entry points, and naming conventions

The scorer accepts two entry points (the same example structure is used in both):

A pair of JSONL files: one for reference, one for hypothesis.
A pair of folders: containing reference and hypothesis JSONL files respectively.

Each JSONL record should contain a dictionary with these fields:

document_metadata: { "document_id": "...", "primary_dataset_name": "..." }
ground_truth: { "transcription_unit": "...", "exclude_from_icdar_evaluation": false }
ocr_hypothesis: { "transcription_unit": "..." }
ocr_postcorrection_output: { "transcription_unit": "..." }

The ground_truth.exclude_from_icdar_evaluation field is optional. When it is set to true, the corresponding document is ignored during scoring.

All JSON documents conform to the HIPE-OCRepair JSON Schema (add link later).

Sample data for quick inspection is available under data/sample.

Reference JSONL files

Reference files follow the HIPE-OCRepair canonical naming convention:

<file_basename>_<version>_<dataset>_<primary_version>_<split>_<language>.jsonl

Where:

file_basename: always hipe-ocrepair-bench
version: benchmark version
dataset: dataset name (e.g., icdar2017)
primary_version: primary dataset version
split: data split (e.g., train, test)
language: dataset language (e.g., en, fr)

Submission (hypothesis) JSONL files

Submission files to be evaluated are named as:

teamname_<inputfile>_runX.jsonl

Installation and usage 🔧

The scorer requires Python 3.12 and can be installed as a pip package or used as an editable dependency:

pip install hipe-ocrepair-scorer

python3 -m venv venv
source venv/bin/activate
pip install -e .

Python usage

import json
from hipe_ocrepair_scorer import Evaluation, align_records

# Load your JSONL files
def load_jsonl(path):
    with open(path, "r", encoding="utf-8") as f:
        return [json.loads(line) for line in f if line.strip()]

REF = load_jsonl("reference.jsonl")
PRED = load_jsonl("hypothesis.jsonl")

# 1. Align the files by document_id
merged_data = align_records(REF, PRED)

# 2. Run the evaluation; .score_over_datasets stratifies by `primary_dataset_name`
evaluator = Evaluation(merged_data)
results = evaluator.score_over_datasets(normalize=True)

# 3. Print results (e.g., Micro-averaged Character MER)
score, lo, hi = results["averaged_scores"]["cmer_micro"]
print(f"Character MER: {score:.4f} (95% CI: {lo:.4f} - {hi:.4f})")

As reference.jsonl and hypothesis.jsonl you could test with this toy reference file and this toy hypothesis file.

CLI usage

After installation, the hipe-ocrepair-scorer command is available.

Evaluate a single file pair

hipe-ocrepair-scorer \
  --reference hipe_ocrepair_scorer/data/sample/reference/hipe-ocrepair-bench_v0.9_icdar2017_v1.2_train_fr.sample.jsonl \
  --hypothesis hipe_ocrepair_scorer/data/sample/hypothesis/no_edits_baseline/no_edits_hipe-ocrepair-bench_v0.9_icdar2017_v1.2_train_fr.sample_run1.jsonl

Evaluate all files in a folder pair

hipe-ocrepair-scorer \
  --reference-dir hipe_ocrepair_scorer/data/sample/reference/ \
  --hypothesis-dir hipe_ocrepair_scorer/data/sample/hypothesis/no_edits_baseline/

In folder mode, the scorer matches each reference file to its corresponding hypothesis file by filename. Hypothesis files are expected to contain the reference filename stem (see naming conventions above).

Output format

Results are printed to stdout as JSON.

File mode returns scores for the single file pair:

{
  "averaged_scores": {
    "metric_name": [score, lower_ci, upper_ci],
    ...
  },
  "fold_scores": {
    "dataset_name": {
      "metric_name": [score, lower_ci, upper_ci],
      ...
    }
  }
}

Folder mode returns per-file results for each reference/hypothesis pair:

{
  "per_file": {
    "reference_filename_1": {
      "averaged_scores": { ... },
      "fold_scores": { ... }
    },
    "reference_filename_2": {
      "averaged_scores": { ... },
      "fold_scores": { ... }
    }
  }
}

Each metric is a tuple of (score, lower_95%_CI, upper_95%_CI). Metrics include cmer_micro, wmer_micro, cmer_macro, wmer_macro, pref_score_cmer_macro, pref_score_wmer_macro, pcis_cmer_macro, and pcis_wmer_macro.

About

License

See the LICENSE file in the repository for details.

Acknowledgments

The HIPE-2026 organising team expresses its sincere appreciation to the ICDAR 2026 Conference and Competition Committee for hosting the task. HIPE-eval editions are organised within the framework of the Impresso - Media Monitoring of the Past project, funded by the Swiss National Science Foundation under grant No. CRSII5_213585 and by the Luxembourg National Research Fund under grant No. 17498891.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.9.9

Apr 13, 2026

0.9.8

Apr 13, 2026

0.9.7

Apr 13, 2026

0.9.4

Apr 10, 2026

0.9.3

Apr 10, 2026

0.9.2

Apr 10, 2026

0.9.0

Mar 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hipe_ocrepair_scorer-0.9.9.tar.gz (20.0 kB view details)

Uploaded Apr 13, 2026 Source

File details

Details for the file hipe_ocrepair_scorer-0.9.9.tar.gz.

File metadata

Download URL: hipe_ocrepair_scorer-0.9.9.tar.gz
Upload date: Apr 13, 2026
Size: 20.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for hipe_ocrepair_scorer-0.9.9.tar.gz
Algorithm	Hash digest
SHA256	`b75ad03b84f02c8d3a5dbc08f890907cd15556b71385393c139be69bcea8563b`
MD5	`9330eeb478e8f7f08d86db1d0e8a30cd`
BLAKE2b-256	`ee1499eca076e33772634a91841cd6618970774d40d71e9b67d3ce7bd5f8e1d2`

See more details on using hashes here.

hipe-ocrepair-scorer 0.9.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HIPE-OCRepair-scorer

Related repositories and websites:

Release history

Main functionalities 📊

Metrics

Normalization and stratification

Input format, scorer entry points, and naming conventions

Reference JSONL files

Submission (hypothesis) JSONL files

Installation and usage 🔧

Python usage

CLI usage

Evaluate a single file pair

Evaluate all files in a folder pair

Output format

About

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes