Skip to main content

Handprint text recognition in form documents.

Project description

formHTR

Handprint text recognition in form documents.

PyPI version

Trec

Installation

pip

pip install formhtr

The tool also requires the zbar shared library installed (used by pyzbar). For PDF-related tooling, qpdf is also required.

System dependencies:

  • macOS (Homebrew): brew install zbar qpdf
  • Debian/Ubuntu: sudo apt-get install libzbar0 qpdf
  • Fedora: sudo dnf install zbar qpdf

You can verify runtime requirements with:

formhtr doctor

conda (dev)

conda env create -f conda_env.yaml

Usage

Run formhtr --help for full CLI help.

Quickstart

# 1) Verify system dependencies
formhtr doctor

# 2) Create ROI config for a template
formhtr select-rois --pdf-file template.pdf --output-file config.json

# 3) Optionally annotate ROI types and variable names
formhtr annotate-rois --pdf-file template.pdf --config-file config.json --output-file config_annotated.json

# 4) Process a scanned logsheet into XLSX
formhtr process-logsheet \
  --pdf-logsheet scan.pdf \
  --pdf-template template.pdf \
  --config-file config_annotated.json \
  --output-file output.xlsx \
  --google google_credentials.json \
  --amazon amazon_credentials.json \
  --azure azure_credentials.json

Create ROIs

This functionality is split (for now) into two separate scripts.

select ROIs

Find and define locations of regions of interest (ROIs) in the given PDF.

Generally, it is possible to draw ROIs (rectangles) manually but also to detect them automatically. The coordinates of ROIs are stored in a JSON file.

The tool is supposed to be run from the command line, as the control commands are entered there.

Control commands

  • Press q or Esc to exit editing and save the config file.
  • Press r to remove the last rectangle.

Run formhtr select-rois -h for details.

annotate ROIs

Specify the type of content for each rectangle.

The workflow is designed in a way that you can navigate over specified ROIs and assign them the expected type of their content. This is done by pressing appropriate control commands.

Control commands

  • Press q or Esc to exit editing and save the config file.
  • Press h to add "Handwritten" type to the current ROI.
  • Press c to add "Checkbox" type to the current ROI.
  • Press b to add "Barcode" type to the current ROI.
  • Press r or d to delete the type from the current ROI.
  • Press v to enter the variable name.
  • Press an arrow to navigate through ROIs (only left and right for now).

Run formhtr annotate-rois -h for details.

process logsheet

Extract values from specified ROIs.

This is the crucial step that applies various techniques to extract the information as precisely as possible. It can process one logsheet at a time, given the template and config files.

Run formhtr process-logsheet -h for details.

Credentials

The processing of logsheets is using external services requiring credentials to use them. Here we specify structure that is expected for credentials, always in JSON format.

Google

{
  "type": "service_account",
  "project_id": "theid",,
  "private_key_id": "thekey",
  "private_key": "-----BEGIN PRIVATE KEY-----anotherkey-----END PRIVATE KEY-----\n"
  "client_email": "emailaddress",
  "client_id": "id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "someurl",
  "universe_domain": "googleapis.com"
}

Amazon

{
    "ACCESS_KEY": "YOUR_KEY_ID_HERE",
    "SECRET_KEY": "YOUR_ACCESS_KEY_HERE",
    "REGION": "YOUR_REGION_NAME_HERE"
}

Microsoft

{
    "SUBSCRIPTION_KEY": "YOURKEYHERE",
    "ENDPOINT": "https://ENDPOINT"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

formhtr-0.1.0.tar.gz (182.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

formhtr-0.1.0-py3-none-any.whl (188.2 kB view details)

Uploaded Python 3

File details

Details for the file formhtr-0.1.0.tar.gz.

File metadata

  • Download URL: formhtr-0.1.0.tar.gz
  • Upload date:
  • Size: 182.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for formhtr-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b6391b308afeaf1578f816a89fa12f74f4ddd779cbddbd447367c3f6c114c031
MD5 0a1dbbaed6704e1c204347c81e419699
BLAKE2b-256 3f03ce275e03f0c244c84cf5f91d5a73f3658fb527a1b2db66050dca818632ad

See more details on using hashes here.

Provenance

The following attestation bundles were made for formhtr-0.1.0.tar.gz:

Publisher: publish_pypi.yml on grp-bork/formHTR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file formhtr-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: formhtr-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 188.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for formhtr-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 11c9978cf7c0354ed18348ef365acf09551531519cf59717637b9529d0f129c4
MD5 6531f702eef1750e29ec6936ca115036
BLAKE2b-256 e28947391ae34402e6554fd072dc096f99e675f948e13df38933c319a0a31eb7

See more details on using hashes here.

Provenance

The following attestation bundles were made for formhtr-0.1.0-py3-none-any.whl:

Publisher: publish_pypi.yml on grp-bork/formHTR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page