HTRVX, HTR Validation with XSD
Project description
HTRVX : HTR Validation for eXtra-quality controlled documents
HTRVX - pronounced Ashterux - allows for quality control of XML using XSD schema validation, Segmonto validation and other verifications.
How to install
Simply run pip install htrvx
How to run
The basic way to run the script is htrvx PATHTOFILES --format FORMAT, eg. htrvx ./tests/test_data/page/*.xml --format page
Each verification is an opt-in verification: you need to express the fact that you want to check it.
--segmontowill check for Segmonto compliancy- You can use your own vocabulary or a restricted Segmonto vocabulary by using
--zone ZONENAMEand--line LINENAMEsuch ashtrvx [...] --line DefaultLine --line HeadingLine --zone MainZone - You can use
--allow-untaggedwith eitherline,zoneorbothso that zones without type are allowed. If you want to limit such lines or zone, combine it with--max-untagged-zones Nor--max-untagged-lines Nwhere N is the number of allowed occurrences.
- You can use your own vocabulary or a restricted Segmonto vocabulary by using
--xsdwill check if the data are compliant with XML Schemas--check-emptywill check if regions have no lines or if lines have no text--check-emptycan be refined with--raise-emptyto throw an error if empty elements are found, otherwise it's simply reported. =--check-imagechecks for link in the XML. Link are checked relatively to the XML file, ie. if XML file ./data/element.xml points to file.jpeg, file ./data/file.jpeg is expected to exist.
Other parameters mainly have to do with verbosity: --verbose displays details about errors, --group groups errors (instead of showing one line per error, groups by error types).
| Parameters | Default | Function |
|---|---|---|
| -v, --verbose | False | Prints more information |
| -f, --format [alto,page] | alto | Format of files |
| -s, --segmonto | False | Apply Segmonto Zoning verification |
| -e, --check-empty | False | Check for empty lines or empty zones |
| -r, --raise-empty | False | Warns but not fails if empty lines or empty zones are found |
| -x, --xsd | False | Apply XSD Schema verification |
| -g, --group | False | Group error types (reduce verbosity) |
| -i, --check-image | False | Check if the image link in the XML points to the right path |
| -l, --verbose-level | zen | Level of details and amount of color shown in the logs (see below). |
| --zone TEXT | None | Provide a custom zone to control zone types instead of Segmonto |
| --line TEXT | None | Provide a custom line to control Line types instead of Segmonto |
Verbosity levels
minimal: shows only failing tests, no details.low: shows only failing test and their details, such as which lines fails in a file.zen(default): shows all tests and their details, but displays only one color (red for errors).all: shows everything.
Github Action code
If you want to add this to your github repository, as a continuous integration workflow, add a file htrux.yml at in the path .github/workflows of your repository.
# This workflow will install Python dependencies, run tests and lint with a single version of Python
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: HTRVX
on: [push, pull_request] # You can edit this of course !
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install htrvx
- name: Run HTRVX
run: |
htrvx --verbose --group --format alto --segmonto --xsd --check-empty --raise-empty UNIX/Path/to/**/your/*.xml
Logo by Alix Chagué.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file htrvx-0.0.19.tar.gz.
File metadata
- Download URL: htrvx-0.0.19.tar.gz
- Upload date:
- Size: 134.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6869407e97661c29ac45df38232ee35d7eb0bfd26204ae3aaa126138064edd72
|
|
| MD5 |
046903b5c5aeac1b1ef1aafea2a515fe
|
|
| BLAKE2b-256 |
99746a652c7c9c8536df2f4f0b6a2709922ce454cfbe9d0d6c345a79bdf28856
|
File details
Details for the file htrvx-0.0.19-py2.py3-none-any.whl.
File metadata
- Download URL: htrvx-0.0.19-py2.py3-none-any.whl
- Upload date:
- Size: 151.0 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2963117e52599513af4f55d207f473aa191a28dd9e9e3fdf5181571dc11cbcf1
|
|
| MD5 |
9cef0092e026ccc009e751063bace148
|
|
| BLAKE2b-256 |
4f8ca669ec0af4363fd9a78276ad3fcdbdcd841c68597a9aa141c929ab3cdc26
|