Skip to main content

HTRMoPo repository reference implementation

Project description

HTRMoPo

tests

HTRMoPo is a schema and an implementation for an automatic text recognition model repository hosted on the Zenodo research data infrastructure. It is designed to enable discoverability of models across a wide number of software and ATR-related tasks and aid in model selection.

There are two versions of the schema: v0 and v1. v0 is the legacy kraken model schema for the Zenodo repository that is fairly limited, in particular by not supporting non-recognition models and providing limited ways of incorporating model cards. v1 is intended for all kinds of machine learning models involved in ATR independent of software.

Schema

v0

v0 is conserved for historical interest mostly. Records in v0 format consist of a JSON metadata file and at most a single model file that is referenced in it.

v1

Repository records following the v1 schema consist of a Markdown model card with a YAML metadata front matter and an arbitrary number of files in the record. There is an example for the model card that is inspired by the huggingface example template but in principle model cards are free form. The front matter can be validated against a JSON schema found here.

How does it work ?

Install the python library and prepare a model card for your ATR model, no matter of segmentation, recognition, reading order, postcorrection, .... Afterwards you need to create an account on Zenodo and create an API access token as described here.

With the HTRMoPo reference implementation and the access token you can then create model deposits on Zenodo. Deposits will be immediately accessible to the whole world but won't be discoverable until the community inclusion request is manually approved by one of the repository administrators.

Using a research data infrastructure like Zenodo assures long-term accessibility of the deposited models while also enabling good scientific practices like reproducibility and crediting contributions.

Deposits and Identifiers

Each model in the repository consists of the model card with metadata and one or more model files and is identified by two persistent and unique DOIs. One of the DOIs refers to the deposit, which means a single model, itself while the second one is called the concept DOI. An example is 10.5281/zenodo.7051646 with concept DOI 10.5281/zenodo.7051645. When a new version of a model is updated to the repository a new DOI is created, for example 10.5281/zenodo.14585602 for the above model but the concept DOI remains the same, aggregating all versions of a model under a single identifier. The concept DOI therefore aggregates all versions of the model and in addition will always link to the latest version of it.

Python Library

A reference implementation to interact with the repository on Zenodo is in the htrmopo directory, containing both a python library and command line drivers.

The library can be installed using pip:

~> pip install htrmopo

CLI

The htrmopo command line tool is used to query the repository, download existing models, and upload and update items to it.

Querying the repository

To get a listing of all models:

~ htrmopo list
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ DOI                         ┃ summary                        ┃ model type   ┃ keywords                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 10.5281/zenodo.7051645      │                                │              │                                │
│ ├── 10.5281/zenodo.14585602 │ Printed Urdu Base Model        │ recognition  │ automatic-text-recognition     │
│ ├── 10.5281/zenodo.14574660 │ Printed Urdu Base Model        │ recognition  │ kraken_pytorch                 │
│ └── 10.5281/zenodo.7051646  │ Printed Urdu Base Model        │ recognition  │ kraken_pytorch                 │
│                             │                                │              │                                │
│                             │                                │              │                                │
│                             │                                │              │                                │
│ 10.5281/zenodo.10066218     │                                │              │                                │
│ ├── 10.5281/zenodo.12743230 │ CATMuS Medieval 1.5.0          │ recognition  │ kraken_pytorch; handwritten    │
│ └── 10.5281/zenodo.10066219 │ CATMuS Medieval                │ recognition  │ text recognition; htr; middle  │
│                             │                                │              │ ages                           │
│                             │                                │              │ kraken_pytorch; handwritten    │
│                             │                                │              │ text recognition; htr; middle  │
│                             │                                │              │ ages                           │
│ 10.5281/zenodo.13788176     │                                │              │                                │
│ └── 10.5281/zenodo.13788177 │ McCATMuS - Transcription model │ recognition  │ kraken_pytorch; HTR; OCR;      │
│                             │ for handwritten, printed and   │              │ generic model                  │
│                             │ typewritten documents from the │              │                                │
│                             │ 16th century to the 21st       │              │                                │
│                             │ century                        │              │                                │
│ 10.5281/zenodo.14602568     │                                │              │                                │
│ └── 10.5281/zenodo.14602569 │ General segmentation model for │ segmentation │ multiscriptal                  │
│                             │ print and handwriting          │              │                                │
│ 10.5281/zenodo.5468572      │                                │              │                                │
│ └── 10.5281/zenodo.5468573  │ Medieval Hebrew manuscripts in │ recognition  │ kraken_pytorch                  
...

Records are represented in a tree structure in the left-most column. The DOI at the root of each tree is a concept DOI which always links to the most recent version of a model. The leaves of the tree are particular versions of the record ordered chronologically. Either type of DOI is acceptable as arguments for the functions below although it is recommended to reference a concrete version in contexts where reproducibility is desired.

To fetch the metadata for a single model (both v0 and v1 schema):

~> htrmopo show 10.5281/zenodo.10800223

            HTR model for documentary Latin, Old French and Spanish medieval manuscripts (11th-16th)            
┌──────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│ DOI              │ 10.5281/zenodo.10800223                                                                   │
│ concept DOI      │ 10.5281/zenodo.7547437                                                                    │
│ publication date │ 2024-03-14T01:47:02+00:00                                                                 │
│ model type       │ recognition                                                                               │
│ script           │ Latin                                                                                     │
│ alphabet         │ ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; = > ? @ A B C D E F G H I J K L M N │
│                  │ O P Q R S T U V W X Y Z [ \ ] ^ _ a b c d e f g h i j k l m n o p q r s t u v w x y z { | │
│                  │ } ~ ¡ £ § ª « ¬ ° ¶ º » ½ ¾ À Ä Ç È É Ë Ï Û Ü à á â ä æ ç è é ê ë ì í î ï ñ ò ó ô ö ù ú û │
│                  │ ü ÿ ā ă ē ĕ ę ī ō ŏ œ ŭ ƒ ȩ ˀ ο а е о с ᗅ – — ‘ ’ ” „ † … ⁖ ₎ 〈 〉 ✳ ꝫ                   │
│                  │ 0x9, SPACE, 0x92, 0x97, NO-BREAK SPACE, COMBINING MACRON, COMBINING LATIN SMALL LETTER A, │
│                  │ COMBINING LATIN SMALL LETTER E, COMBINING LATIN SMALL LETTER O, COMBINING LATIN SMALL     │
│                  │ LETTER U, COMBINING LATIN SMALL LETTER C, WORD JOINER, 0xf2f7                             │
│ keywords         │ Handwritten text recognition                                                              │
│                  │ Handwritten text recognition for Medieval manuscripts                                     │
│                  │ Digital Paleography                                                                       │
│ metrics          │ cer: 7.82                                                                                 │
│ license          │ MIT License                                                                               │
│ creators         │ Torres Aguilar, Sergio (https://orcid.org/0000-0002-1801-3147) (University of Luxembourg) │
│                  │ Jolivet, Vincent (École nationale des chartes)                                            │
│                  │ Sergio Torres Aguilar (University of Luxembourg)                                          │
│ description      │ The model was trained on diplomatic transcriptions of documentary manuscripts from the    │
│                  │ Late-medieval period (12-15th) and early modernity (16th). The training and evaluation    │
│                  │ sets entail 215k lines and 2.4M of tokens using open source corpora.                      │
│                  │                                                                                           │
└──────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

Downloading a single model:

~> htrmopo get 10.5281/zenodo.7547437 
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Model name: /home/mittagessen/.local/share/htrmopo/0ac39ba5-8f85-5ea1-913a-f84a13ca756f

Models are placed per default in reproducible locations in the application state dir printed after the download is finished. The -o option allows customization of that behavior:

~> htrmopo get -o manu 10.5281/zenodo.7547437
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Model name: /home/mittagessen/manu

Publishing models

There are two modes of publishing ATR models with the htrmopo command. The first creates new stand-alone deposits while the second one creates a new version of an existing record that will all be grouped under the same concept DOI. Updating a model deposit is usually done when a prior model is retrained with additional training data, the metadata has been refined, or additional evaluation has been done.

The calls for both modes are very similar, the only difference being -d option giving the DOI of an existing model deposit in the repository:

~> htrmopo publish -i model_card.md -a ${ACCESS_TOKEN} model_dir
Uploading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
model PID: 10.5072/zenodo.146629

~> htrmopo publish -d 10.5072/zenodo.146502 -i model_card.md -a ${ACCESS_TOKEN} model_dir
Uploading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
model PID: 10.5072/zenodo.146627

Configuration

The tool is intended to work out of the box but sometimes it can be useful for testing purposes to point it to another instance of InvenioDRM such as the Zenodo sandbox in order not to pollute the main repository with spurious deposits.

You can set the OAI-PMH API endpoint (required for querying) and InvenioDRM endpoint (needed for querying and publishing) with the MODEL_REPO_OAI_URL and MODEL_REPO_URL environments, for example:

MODEL_REPO_URL=https://sandbox.zenodo.org/api/ htrmopo publish -i model_card.md -a ....

will upload a model to the sandbox instance of Zenodo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htrmopo-0.5.tar.gz (113.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

htrmopo-0.5-py3-none-any.whl (114.3 kB view details)

Uploaded Python 3

File details

Details for the file htrmopo-0.5.tar.gz.

File metadata

  • Download URL: htrmopo-0.5.tar.gz
  • Upload date:
  • Size: 113.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for htrmopo-0.5.tar.gz
Algorithm Hash digest
SHA256 900f1111e7d7ab05eda0e69d243a12bada07d0360d52b3b018d7b6b722a5a787
MD5 85248316fa1c94d58bcee45b2fdbbc70
BLAKE2b-256 11c3e382ffd6e35e3640ffe3c12436e2ba38f20fd82e0b28934480baf6325627

See more details on using hashes here.

Provenance

The following attestation bundles were made for htrmopo-0.5.tar.gz:

Publisher: test.yml on mittagessen/HTRMoPo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file htrmopo-0.5-py3-none-any.whl.

File metadata

  • Download URL: htrmopo-0.5-py3-none-any.whl
  • Upload date:
  • Size: 114.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for htrmopo-0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 b17ed8180471b8c83a3a91def08fa77e04bc518e2cb046e718173ec797172328
MD5 6b9b6af7d274edf6c98b409b7624b009
BLAKE2b-256 10f144692cfc28fbfe011c642c1447d35225e43168223d60877272021c3c8e9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for htrmopo-0.5-py3-none-any.whl:

Publisher: test.yml on mittagessen/HTRMoPo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page