CLI for preparing Handwritten Text Recognition (HTR) datasets for PyLaia traininig.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

ndrscl

These details have not been verified by PyPI

Project description

HTR CLI

CLI for preparing HTR datasets

This CLI tool offers a quick way to (optionally) pull PAGE-XML data from Transkribus[^1] and to prepare perfectly valid dataset/ directory ready for feeding PyLaia.

If you find any value in this project please leave a star and consider to offer me a coffee (Paypal or Github sponsor).

Installation

uv tool install htr-cli
# or
pipx install htr-cli
# or
pip install htr-cli

Verify the install:

htr-cli --help

pull-transkribus depends on transkribus-client, which pins lxml==4.6.3 (no wheels for Python 3.11+ on most platforms). Until that pin is relaxed upstream, you need to override it at install time. uv is the only package manager that supports this cleanly.

echo "lxml>=5.0" > overrides.txt
uv tool install --override overrides.txt 'htr-cli[transkribus]'

For devs:

git clone https://github.com/ndrscalia/htr-cli.git
cd htr-cli
uv sync --all-extras # override already wired in pyproject.toml

Run the test suite:

uv run pytest
uv run ruff check

Usage

The CLI exposes the following sub-commands (in pipeline order):

init
scaffold
data-extraction
split-dataset
process-images
process-images-tfe

init

Interactive setup. If used, asks for Transkribus email and password and sores them in your OS keyring under the service name htr-cli.

scaffold

Creates the directory layout the rest of the pipeline expects:

.
└── root/
    ├── dataset/
    │   ├── images/
    │   │   ├── train
    │   │   ├── val
    │   │   └── test
    │   └── pyalaia's stuff and more
    └── data/
        ├── images
        └── xml_texts

pull-transkribus

Downloads GT pages from Transkribus (default). For every collection, walks documents and pages, filtering by --page-status, and writes each page's image do data/images/ and its PAGE-XML counterpart to data/xml_tests/. Naming pattern: {collection}_{docId}_{pageId}_{imageId}_{filename}.{jpg,xml}. This subcommand relies on transkribus-client, which in turn relies on legacy API that might be discontinued soon. Install with the optional [transkribus] extra (see the installation section above).

port-escriptorium

Normalizes eScriptorium-style PAGE-XML into the Transkribus convention for which data-extraction was written. - lines get nested in the typed TextRegion; - readingOrder is injected on regions and lines; - already compliant files are skipped; - anchor for assignment + sort is the baseline midpoint (with polygon's centroid as fallback), because a centroid can land just outside a region's bbox even if it belongs to it.

[!WARNING] Files are modified in place. If needed, prepare a backup version.

data-extraction

Parses every XML found in data/xml_texts/ and emits the intermediate dataset files:

polygons_coordinates.json: per-line polygon coordinates for later cropping.
lines.csv: one row per line: page, region id, region order, reading order, line id, raw text, "unclear" flag.
dataset/syms.txt: PyLaia's CTC symbol vocabulary.
dataset/{tokens,lexicon_characters,dictionary}.txt: different files for KenLM model creation and further post-processing (coming soon).

Optional positional argument filters by region type and --unclear-name sets the custom-tag name that skips the line. E.g.:

htr-cli data-extraction paragraph --unclear-name "unclear"

NBSP chars are normalize to plain space.

split-dataset

Reads lines.csv and partitions by page (not line) using a deterministic random_state. The following options are available:

--val-size: validation set size (default to 10%);
--test-size: test set size (default to 0%);
--omit-unclear / -u: omit lines where an "unclear" tag appears.

If you only want validation set, skip passing --test-size.

To supply your own page-level split instead of the random one, pass --custom-train / --custom-val / --custom-test, each pointing at a text file with one id per line. Files accept either bare page names or full line ids, so the dataset/{train,val,test}_ids.txt files this command writes can be fed back in to reproduce a previous split. --custom-test is optional (omit it for a train+val split). When any custom flag is set, --val-size and --test-size are ignored.

This subcommand write the following files:

dataset/{train,val,test}_ids.txt
dataset/{train,val,test}.txt
dataset/{train,val,test}_text.txt
dataset/corpus_characters.txt (corpus for char KenLM training)
missing_reading_order.csv (lines dropped if no reading order available)

process-images

The pure-Python preprocessing pipeline. For each entry in polygons_coordinates.json: loads the source image, masks out everything outside the polygon, crops to the polygon's bounding box, converts to grayscale, then runs the selected pipeline:

--full-pipeline (default): contrast stretch - modified Sauvola - deslope - deslant (TFE port) - moment-normalize.
--light-pipeline: deslope - deslant (vendored DeslantImg) - modified Sauvola.

Each stage is individually toggleable (--no-contrast-stretch, --no-enhance-sauvola, etc.). Final image is resized to norm_height (positional arg, default 64 px) preserving aspect ratio, padded 10 px left/right with white, and written to dataset/images/{train,val,test}/ based on which split the line belongs to.

At the beginning of every run, dataset/images_processing_ckpt.txt is written and allows to interrupt and resume processing. If you want to abort processing the images and start from scratch, you have to delete that file.

process-images-tfe

Same as process-image, but uses TextFeatExtractor C++ library and the default parameters are based on Transkribus' params:

tfe = TextFeatExtractor(
    stretch=True,
    enh=True,
    enh_win=30,
    enh_prm=0.1,
    enh3_prm0=0,
    enh3_prm2=0,
    deslope=True,
    deslant=True,
    normxheight=0,
    normheight=64,
    momentnorm=True,
    fcontour_dilate=0,
    padding=10,
    maxwidth=6000,
  )

This subcommand requires textfeat + pagexml Python bindings, which are not on PyPi and will only run on Linux (see the Dockerfile.tfe section below).

Use this if you want the C++ behavior, and process-images when you want a pure-Python install that works on macOS or outside Docker in general.

Differences in performances have not been compared extensively with my simpler implementation, but the two commands take more or less the same time to run and both mess up some images.

To get further help:

htr-cli --help
htr-cli COMMAND --help

Dockerfile.tfe

This is needed to run htr-cli image-proessing-tfe if you are not on Linux.

Build the docker image (one-time, from the repo root):

docker build -t textfeatextractor .

To run the docker image that allows you to use TextFeatExtractor, mount the entire project directory into the container so the htr-cli package, data/, and dataset/ are all visible inside:

docker run -it -v "$(pwd)":/workspace -w /workspace textfeatextractor bash

-v "$(pwd)":/workspace bind-mounts the repo root at /workspace. Use an absolute path (e.g. /Users/andreascalia/code/my_project) instead of $(pwd) if you're invoking the command from outside the project.
-w /workspace sets the working directory so relative paths like data/, dataset/, and polygons_coordinates.json resolve the way the CLI expects.

The Dockerfile only installs the C++ deps (pagexml, textfeat); install the CLI as explained at the beginning of these docs.

Future Updates

Further post-processing options to get better CER and WER.
Standard configs file to easier PyLaia's use.
ALTO XML format support.
Kraken support.

[^1]: This feature relies on legacy API. It might not work anymore in the near future.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

ndrscl

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.0

Jun 8, 2026

0.2.0

Jun 5, 2026

0.1.2

May 31, 2026

0.1.1 yanked

May 31, 2026

0.1.0 yanked

May 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htr_cli-0.3.0.tar.gz (117.6 kB view details)

Uploaded Jun 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

htr_cli-0.3.0-py3-none-any.whl (27.5 kB view details)

Uploaded Jun 8, 2026 Python 3

File details

Details for the file htr_cli-0.3.0.tar.gz.

File metadata

Download URL: htr_cli-0.3.0.tar.gz
Upload date: Jun 8, 2026
Size: 117.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for htr_cli-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`1506e6ce0183078333a037d71fe8a07d203658f1385f50b496f55d2ec339a9da`
MD5	`af4bdbe0d46c04ee0a657a4735c13cc2`
BLAKE2b-256	`21983d2f901a323f82959dcaa3597cccd898a7463afda67bd60ddcb148a1a10e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for htr_cli-0.3.0.tar.gz:

Publisher: publish.yml on ndrscalia/htr-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: htr_cli-0.3.0.tar.gz
- Subject digest: 1506e6ce0183078333a037d71fe8a07d203658f1385f50b496f55d2ec339a9da
- Sigstore transparency entry: 1758184768
- Sigstore integration time: Jun 8, 2026
Source repository:
- Permalink: ndrscalia/htr-cli@73479bec300c4425419a039707d3a0ca4021ad15
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/ndrscalia
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@73479bec300c4425419a039707d3a0ca4021ad15
- Trigger Event: release

File details

Details for the file htr_cli-0.3.0-py3-none-any.whl.

File metadata

Download URL: htr_cli-0.3.0-py3-none-any.whl
Upload date: Jun 8, 2026
Size: 27.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for htr_cli-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d093bd21c0e159c3642a3c504e964edf2f46f40dc6c41b319924e94fdbf0463d`
MD5	`d5c0ab6d8e23cddcd05954acf3f91f6f`
BLAKE2b-256	`ade1a02e83fcc80b08e5eb11f3af060f88dec7d3cd509f29a9a17e8c04a80ee9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for htr_cli-0.3.0-py3-none-any.whl:

Publisher: publish.yml on ndrscalia/htr-cli

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: htr_cli-0.3.0-py3-none-any.whl
- Subject digest: d093bd21c0e159c3642a3c504e964edf2f46f40dc6c41b319924e94fdbf0463d
- Sigstore transparency entry: 1758184914
- Sigstore integration time: Jun 8, 2026
Source repository:
- Permalink: ndrscalia/htr-cli@73479bec300c4425419a039707d3a0ca4021ad15
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/ndrscalia
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@73479bec300c4425419a039707d3a0ca4021ad15
- Trigger Event: release

htr-cli 0.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CLI for preparing HTR datasets

Installation

Usage

init

scaffold

pull-transkribus

port-escriptorium

data-extraction

split-dataset

process-images

process-images-tfe

Dockerfile.tfe

Future Updates

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance