Skip to main content

HTRVX, HTR Validation with XSD

Project description

HTRVX : HTR Validation for eXtra-quality controlled documents

Test library

HTRVX - pronounced Ashterux - allows for quality control of XML using XSD schema validation, Segmonto validation and other verifications.

How to install

Simply run pip install htrvx

How to run

The basic way to run the script is htrvx PATHTOFILES --format FORMAT, eg. htrvx ./tests/test_data/page/*.xml --format page

Each verification is an opt-in verification: you need to express the fact that you want to check it.

  • --segmonto will check for Segmonto compliancy
    • You can use your own vocabulary or a restricted Segmonto vocabulary by using --zone ZONENAME and --line LINENAME such as htrvx [...] --line DefaultLine --line HeadingLine --zone MainZone
    • You can use --allow-untagged with either line, zone or both so that zones without type are allowed. If you want to limit such lines or zone, combine it with --max-untagged-zones N or --max-untagged-lines N where N is the number of allowed occurrences.
  • --xsd will check if the data are compliant with XML Schemas
  • --check-empty will check if regions have no lines or if lines have no text
    • --check-empty can be refined with --raise-empty to throw an error if empty elements are found, otherwise it's simply reported. = --check-image checks for link in the XML. Link are checked relatively to the XML file, ie. if XML file ./data/element.xml points to file.jpeg, file ./data/file.jpeg is expected to exist.

Other parameters mainly have to do with verbosity: --verbose displays details about errors, --group groups errors (instead of showing one line per error, groups by error types).

Parameters Default Function
-v, --verbose False Prints more information
-f, --format [alto,page] alto Format of files
-s, --segmonto False Apply Segmonto Zoning verification
-e, --check-empty False Check for empty lines or empty zones
-r, --raise-empty False Warns but not fails if empty lines or empty zones are found
-x, --xsd False Apply XSD Schema verification
-g, --group False Group error types (reduce verbosity)
-i, --check-image False Check if the image link in the XML points to the right path
-l, --verbose-level zen Level of details and amount of color shown in the logs (see below).
--zone TEXT None Provide a custom zone to control zone types instead of Segmonto
--line TEXT None Provide a custom line to control Line types instead of Segmonto

Verbosity levels

  • minimal: shows only failing tests, no details.
  • low: shows only failing test and their details, such as which lines fails in a file.
  • zen (default): shows all tests and their details, but displays only one color (red for errors).
  • all: shows everything.

Github Action code

If you want to add this to your github repository, as a continuous integration workflow, add a file htrux.yml at in the path .github/workflows of your repository.

# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: HTRVX

on: [push, pull_request] # You can edit this of course !

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python 3.8
      uses: actions/setup-python@v2
      with:
        python-version: 3.8
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install htrvx
    - name: Run HTRVX
      run: |
        htrvx --verbose --group --format alto --segmonto --xsd --check-empty --raise-empty UNIX/Path/to/**/your/*.xml

Logo by Alix Chagué.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htrvx-0.0.18.tar.gz (146.7 kB view details)

Uploaded Source

Built Distribution

htrvx-0.0.18-py2.py3-none-any.whl (150.9 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file htrvx-0.0.18.tar.gz.

File metadata

  • Download URL: htrvx-0.0.18.tar.gz
  • Upload date:
  • Size: 146.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for htrvx-0.0.18.tar.gz
Algorithm Hash digest
SHA256 4d32ccd868232920d48d692d7b91ceba1605f9d8b387bda31a5bbe945a8258d6
MD5 cf345ab06df2587811305a919887e1a7
BLAKE2b-256 501b05cc0eb90a913e7a3f059ac0cb53d488d8ad4ad304b218b9917884c71800

See more details on using hashes here.

File details

Details for the file htrvx-0.0.18-py2.py3-none-any.whl.

File metadata

  • Download URL: htrvx-0.0.18-py2.py3-none-any.whl
  • Upload date:
  • Size: 150.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for htrvx-0.0.18-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 a70bb40477c4856450fd379ddd87d515337b8ee15d9206784e002e241cfa55cd
MD5 cf1f7bab9aed5d9a62f274d95ba14a3e
BLAKE2b-256 bd56f70e405fa1684a9db9a9a1570e81520ec824a7dc955288f168bd608a4ccc

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page