Skip to main content

HTRMoPo repository reference implementation

Project description

HTRMoPo

tests

HTRMoPo is a schema and an implementation for an automatic text recognition model repository hosted on the Zenodo research data infrastructure. It is designed to enable discoverability of models across a wide number of software and ATR-related tasks and aid in model selection.

There are two versions of the schema: v0 and v1. v0 is the legacy kraken model schema for the Zenodo repository that is fairly limited, in particular by not supporting non-recognition models and providing limited ways of incorporating model cards. v1 is intended for all kinds of machine learning models involved in ATR independent of software.

Schema

v0

v0 is conserved for historical interest mostly. Records in v0 format consist of a JSON metadata file and at most a single model file that is referenced in it.

v1

Repository records following the v1 schema consist of a Markdown model card with a YAML metadata front matter and an arbitrary number of files in the record. There is an example for the model card that is inspired by the huggingface example template but in principle model cards are free form. The front matter can be validated against a JSON schema found here.

How does it work ?

Install the python library and prepare a model card for your ATR model, no matter of segmentation, recognition, reading order, postcorrection, .... Afterwards you need to create an account on Zenodo and create an API access token as described here.

With the HTRMoPo reference implementation and the access token you can then create model deposits on Zenodo. Deposits will be immediately accessible to the whole world but won't be discoverable until the community inclusion request is manually approved by one of the repository administrators.

Using a research data infrastructure like Zenodo assures long-term accessibility of the deposited models while also enabling good scientific practices like reproducibility and crediting contributions.

Deposits and Identifiers

Each model in the repository consists of the model card with metadata and one or more model files and is identified by two persistent and unique DOIs. One of the DOIs refers to the deposit, which means a single model, itself while the second one is called the concept DOI. An example is 10.5281/zenodo.7051646 with concept DOI 10.5281/zenodo.7051645. When a new version of a model is updated to the repository a new DOI is created, for example 10.5281/zenodo.14585602 for the above model but the concept DOI remains the same, aggregating all versions of a model under a single identifier. The concept DOI therefore aggregates all versions of the model and in addition will always link to the latest version of it.

Python Library

A reference implementation to interact with the repository on Zenodo is in the htrmopo directory, containing both a python library and command line drivers.

The library can be installed using pip:

~> pip install htrmopo

CLI

The htrmopo command line tool is used to query the repository, download existing models, and upload and update items to it.

Querying the repository

To get a listing of all models:

~ htrmopo list
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ DOI                         ┃ summary                        ┃ model type   ┃ keywords                       ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 10.5281/zenodo.7051645      │                                │              │                                │
│ ├── 10.5281/zenodo.14585602 │ Printed Urdu Base Model        │ recognition  │ automatic-text-recognition     │
│ ├── 10.5281/zenodo.14574660 │ Printed Urdu Base Model        │ recognition  │ kraken_pytorch                 │
│ └── 10.5281/zenodo.7051646  │ Printed Urdu Base Model        │ recognition  │ kraken_pytorch                 │
│                             │                                │              │                                │
│                             │                                │              │                                │
│                             │                                │              │                                │
│ 10.5281/zenodo.10066218     │                                │              │                                │
│ ├── 10.5281/zenodo.12743230 │ CATMuS Medieval 1.5.0          │ recognition  │ kraken_pytorch; handwritten    │
│ └── 10.5281/zenodo.10066219 │ CATMuS Medieval                │ recognition  │ text recognition; htr; middle  │
│                             │                                │              │ ages                           │
│                             │                                │              │ kraken_pytorch; handwritten    │
│                             │                                │              │ text recognition; htr; middle  │
│                             │                                │              │ ages                           │
│ 10.5281/zenodo.13788176     │                                │              │                                │
│ └── 10.5281/zenodo.13788177 │ McCATMuS - Transcription model │ recognition  │ kraken_pytorch; HTR; OCR;      │
│                             │ for handwritten, printed and   │              │ generic model                  │
│                             │ typewritten documents from the │              │                                │
│                             │ 16th century to the 21st       │              │                                │
│                             │ century                        │              │                                │
│ 10.5281/zenodo.14602568     │                                │              │                                │
│ └── 10.5281/zenodo.14602569 │ General segmentation model for │ segmentation │ multiscriptal                  │
│                             │ print and handwriting          │              │                                │
│ 10.5281/zenodo.5468572      │                                │              │                                │
│ └── 10.5281/zenodo.5468573  │ Medieval Hebrew manuscripts in │ recognition  │ kraken_pytorch                  
...

Records are represented in a tree structure in the left-most column. The DOI at the root of each tree is a concept DOI which always links to the most recent version of a model. The leaves of the tree are particular versions of the record ordered chronologically. Either type of DOI is acceptable as arguments for the functions below although it is recommended to reference a concrete version in contexts where reproducibility is desired.

To fetch the metadata for a single model (both v0 and v1 schema):

~> htrmopo show 10.5281/zenodo.10800223

            HTR model for documentary Latin, Old French and Spanish medieval manuscripts (11th-16th)            
┌──────────────────┬───────────────────────────────────────────────────────────────────────────────────────────┐
│ DOI              │ 10.5281/zenodo.10800223                                                                   │
│ concept DOI      │ 10.5281/zenodo.7547437                                                                    │
│ publication date │ 2024-03-14T01:47:02+00:00                                                                 │
│ model type       │ recognition                                                                               │
│ script           │ Latin                                                                                     │
│ alphabet         │ ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; = > ? @ A B C D E F G H I J K L M N │
│                  │ O P Q R S T U V W X Y Z [ \ ] ^ _ a b c d e f g h i j k l m n o p q r s t u v w x y z { | │
│                  │ } ~ ¡ £ § ª « ¬ ° ¶ º » ½ ¾ À Ä Ç È É Ë Ï Û Ü à á â ä æ ç è é ê ë ì í î ï ñ ò ó ô ö ù ú û │
│                  │ ü ÿ ā ă ē ĕ ę ī ō ŏ œ ŭ ƒ ȩ ˀ ο а е о с ᗅ – — ‘ ’ ” „ † … ⁖ ₎ 〈 〉 ✳ ꝫ                   │
│                  │ 0x9, SPACE, 0x92, 0x97, NO-BREAK SPACE, COMBINING MACRON, COMBINING LATIN SMALL LETTER A, │
│                  │ COMBINING LATIN SMALL LETTER E, COMBINING LATIN SMALL LETTER O, COMBINING LATIN SMALL     │
│                  │ LETTER U, COMBINING LATIN SMALL LETTER C, WORD JOINER, 0xf2f7                             │
│ keywords         │ Handwritten text recognition                                                              │
│                  │ Handwritten text recognition for Medieval manuscripts                                     │
│                  │ Digital Paleography                                                                       │
│ metrics          │ cer: 7.82                                                                                 │
│ license          │ MIT License                                                                               │
│ creators         │ Torres Aguilar, Sergio (https://orcid.org/0000-0002-1801-3147) (University of Luxembourg) │
│                  │ Jolivet, Vincent (École nationale des chartes)                                            │
│                  │ Sergio Torres Aguilar (University of Luxembourg)                                          │
│ description      │ The model was trained on diplomatic transcriptions of documentary manuscripts from the    │
│                  │ Late-medieval period (12-15th) and early modernity (16th). The training and evaluation    │
│                  │ sets entail 215k lines and 2.4M of tokens using open source corpora.                      │
│                  │                                                                                           │
└──────────────────┴───────────────────────────────────────────────────────────────────────────────────────────┘

Downloading a single model:

~> htrmopo get 10.5281/zenodo.7547437 
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Model name: /home/mittagessen/.local/share/htrmopo/0ac39ba5-8f85-5ea1-913a-f84a13ca756f

Models are placed per default in reproducible locations in the application state dir printed after the download is finished. The -o option allows customization of that behavior:

~> htrmopo get -o manu 10.5281/zenodo.7547437
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Model name: /home/mittagessen/manu

Publishing models

There are two modes of publishing ATR models with the htrmopo command. The first creates new stand-alone deposits while the second one creates a new version of an existing record that will all be grouped under the same concept DOI. Updating a model deposit is usually done when a prior model is retrained with additional training data, the metadata has been refined, or additional evaluation has been done.

The calls for both modes are very similar, the only difference being -d option giving the DOI of an existing model deposit in the repository:

~> htrmopo publish -i model_card.md -a ${ACCESS_TOKEN} model_dir
Uploading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
model PID: 10.5072/zenodo.146629

~> htrmopo publish -d 10.5072/zenodo.146502 -i model_card.md -a ${ACCESS_TOKEN} model_dir
Uploading ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
model PID: 10.5072/zenodo.146627

Configuration

The tool is intended to work out of the box but sometimes it can be useful for testing purposes to point it to another instance of InvenioDRM such as the Zenodo sandbox in order not to pollute the main repository with spurious deposits.

You can set the OAI-PMH API endpoint (required for querying) and InvenioDRM endpoint (needed for querying and publishing) with the MODEL_REPO_OAI_URL and MODEL_REPO_URL environments, for example:

MODEL_REPO_URL=https://sandbox.zenodo.org/api/ htrmopo publish -i model_card.md -a ....

will upload a model to the sandbox instance of Zenodo.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htrmopo-0.4.0.tar.gz (122.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

htrmopo-0.4.0-py3-none-any.whl (115.0 kB view details)

Uploaded Python 3

File details

Details for the file htrmopo-0.4.0.tar.gz.

File metadata

  • Download URL: htrmopo-0.4.0.tar.gz
  • Upload date:
  • Size: 122.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for htrmopo-0.4.0.tar.gz
Algorithm Hash digest
SHA256 824389f9affa12da64b7bc64189d51495e9f7fe0dcabdc4c97e8d0b86b83598b
MD5 4eb309fc47e3fd16791db3149949ba8f
BLAKE2b-256 7fbeac5e74aefd7aa116a2684838876851aa718e27d20399513a9fbb7b9130b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for htrmopo-0.4.0.tar.gz:

Publisher: test.yml on mittagessen/HTRMoPo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file htrmopo-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: htrmopo-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 115.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for htrmopo-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 19a13239b2a55fa8914929732f20dbf2ae4b22185ffcfe236bc5265111fa6429
MD5 b6b7afe054e63830d7426a91abe00c13
BLAKE2b-256 9bc82fa2aa0a9c04ac7c98c0fc78a6dcef6d28f3196fae836b09c26d32ff1718

See more details on using hashes here.

Provenance

The following attestation bundles were made for htrmopo-0.4.0-py3-none-any.whl:

Publisher: test.yml on mittagessen/HTRMoPo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page