Implementation of the hOCR specs
Project description
hocr-spec-python
Validation of hOCR close to the specs
Rationale
hOCR is a flavor of HTML for encoding the results of Optical Character Recognition (OCR) engines. It is supported by most OCR engines, such as tesseract, ocropus/ocropy and kraken.
The hOCR specifications is at the same time very simple (hOCR is just HTML) and hard to implement, due to its terseness and lack of up-to-date code samples. This project aims to implement the rules defined by the specs from the ground up to serve as a validation tool and reference implementation. It is meant to help hOCR implementers and support tools like hocr-tools.
Installation
Use pip:
# System-wide:
sudo pip install [--user] hocr-spec
# For current user:
pip install --user hocr-spec
From source:
git clone https://github.com/kba/hocr-spec-python
cd hocr-spec-python
# System-wide:
sudo python setup.py install
# For current user:
python setup.py install --user
Command line interface
usage: hocr-spec [-h] [--format {text,bool,ansi,xml}]
[--profile {relaxed,standard}]
[--implicit_capabilities CAPABILITY]
[--skip-check {attributes,classes,metadata,properties}]
[--parse-strict] [--silent]
sources [sources ...]
positional arguments:
sources hOCR file to check or '-' to read from STDIN
optional arguments:
-h, --help show this help message and exit
--format {text,bool,ansi,xml}, -f {text,bool,ansi,xml}
Report format
--profile {relaxed,standard}, -p {relaxed,standard}
Validation profile
--implicit_capabilities CAPABILITY, -C CAPABILITY
Enable this capability. Use '*' to enable all
capabilities. In addition to the 'ocr*' classes, you
can use ['ocrp_dir', 'ocrp_font', 'ocrp_lang',
'ocrp_nlp', 'ocrp_poly']
--skip-check {attributes,classes,metadata,properties}, -X {attributes,classes,metadata,properties}
Skip one check
--parse-strict Parse HTML with less tolerance for errors
--silent, -s Don't produce any output but signal success with exit
code.
API example
from hocr_spec import HocrValidator
validator = HocrValidator()
report = validator.validate('/path/to/sample.hocr')
print(report.format('xml'))
# <report valid='false'>...</report>
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file hocr-spec-0.2.0.tar.gz
.
File metadata
- Download URL: hocr-spec-0.2.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
0c4b5ab8a448fa090264784f6ede307483f10f700fda859fcc5bb8772130f099
|
|
MD5 |
2ed9b89a520993f6add5e00ecb55e90e
|
|
BLAKE2b-256 |
03bf8cf264162ce1cda47f95e8aa8822c7563a17aee0d1beacea0e91b4febc78
|
File details
Details for the file hocr_spec-0.2.0-py2.py3-none-any.whl
.
File metadata
- Download URL: hocr_spec-0.2.0-py2.py3-none-any.whl
- Upload date:
- Size: 13.2 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
ddb72fada2e7960eadd2f3a5215e587e89f06d9be94db86e9a990eb2102baa32
|
|
MD5 |
b3d44e976ec8c230e78998cf515af00c
|
|
BLAKE2b-256 |
ea36ab3bc452f48469c90389a0f093284c0750df8234854ffff09a3819da9d5f
|