Skip to main content

Implementation of the hOCR specs

Project description

hocr-spec-python

Validation of hOCR close to the specs

Rationale

hOCR is a flavor of HTML for encoding the results of Optical Character Recognition (OCR) engines. It is supported by most OCR engines, such as tesseract, ocropus/ocropy and kraken.

The hOCR specifications is at the same time very simple (hOCR is just HTML) and hard to implement, due to its terseness and lack of up-to-date code samples. This project aims to implement the rules defined by the specs from the ground up to serve as a validation tool and reference implementation. It is meant to help hOCR implementers and support tools like hocr-tools.

Installation

Use pip:

# System-wide:
sudo pip install [--user] hocr-spec
# For current user:
pip install --user hocr-spec

From source:

git clone https://github.com/kba/hocr-spec-python
cd hocr-spec-python
# System-wide:
sudo python setup.py install
# For current user:
python setup.py install --user

Command line interface

usage: hocr-spec [-h] [--format {text,bool,ansi,xml}]
                 [--profile {relaxed,standard}]
                 [--implicit_capabilities CAPABILITY]
                 [--skip-check {attributes,classes,metadata,properties}]
                 [--parse-strict] [--silent]
                 sources [sources ...]

positional arguments:
  sources               hOCR file to check or '-' to read from STDIN

optional arguments:
  -h, --help            show this help message and exit
  --format {text,bool,ansi,xml}, -f {text,bool,ansi,xml}
                        Report format
  --profile {relaxed,standard}, -p {relaxed,standard}
                        Validation profile
  --implicit_capabilities CAPABILITY, -C CAPABILITY
                        Enable this capability. Use '*' to enable all
                        capabilities. In addition to the 'ocr*' classes, you
                        can use ['ocrp_dir', 'ocrp_font', 'ocrp_lang',
                        'ocrp_nlp', 'ocrp_poly']
  --skip-check {attributes,classes,metadata,properties}, -X {attributes,classes,metadata,properties}
                        Skip one check
  --parse-strict        Parse HTML with less tolerance for errors
  --silent, -s          Don't produce any output but signal success with exit
                        code.

API example

from hocr_spec import HocrValidator

validator = HocrValidator()
report = validator.validate('/path/to/sample.hocr')
print(report.format('xml'))
# <report valid='false'>...</report>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hocr-spec-0.2.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

hocr_spec-0.2.0-py2.py3-none-any.whl (13.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file hocr-spec-0.2.0.tar.gz.

File metadata

  • Download URL: hocr-spec-0.2.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9

File hashes

Hashes for hocr-spec-0.2.0.tar.gz
Algorithm Hash digest
SHA256 0c4b5ab8a448fa090264784f6ede307483f10f700fda859fcc5bb8772130f099
MD5 2ed9b89a520993f6add5e00ecb55e90e
BLAKE2b-256 03bf8cf264162ce1cda47f95e8aa8822c7563a17aee0d1beacea0e91b4febc78

See more details on using hashes here.

File details

Details for the file hocr_spec-0.2.0-py2.py3-none-any.whl.

File metadata

  • Download URL: hocr_spec-0.2.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9

File hashes

Hashes for hocr_spec-0.2.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ddb72fada2e7960eadd2f3a5215e587e89f06d9be94db86e9a990eb2102baa32
MD5 b3d44e976ec8c230e78998cf515af00c
BLAKE2b-256 ea36ab3bc452f48469c90389a0f093284c0750df8234854ffff09a3819da9d5f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page