Skip to main content

Handprint text recognition in form documents.

Project description

formHTR

Handprint text recognition in form documents.

PyPI version

Trec

Installation

pip

pip install formhtr

The tool also requires the zbar shared library installed (used by pyzbar). For PDF-related tooling, qpdf is also required.

System dependencies:

  • macOS (Homebrew): brew install zbar qpdf
  • Debian/Ubuntu: sudo apt-get install libzbar0 qpdf
  • Fedora: sudo dnf install zbar qpdf

You can verify runtime requirements with:

formhtr doctor

conda (dev)

conda env create -f conda_env.yaml

Usage

Run formhtr --help for full CLI help.

Quickstart

# 1) Verify system dependencies
formhtr doctor

# 2) Create ROI config for a template
formhtr select-rois --pdf-file template.pdf --output-file config.json

# 3) Optionally annotate ROI types and variable names
formhtr annotate-rois --pdf-file template.pdf --config-file config.json --output-file config_annotated.json

# 4) Process a scanned logsheet into XLSX
formhtr process-logsheet \
  --pdf-logsheet scan.pdf \
  --pdf-template template.pdf \
  --config-file config_annotated.json \
  --output-file output.xlsx \
  --google google_credentials.json \
  --amazon amazon_credentials.json \
  --azure azure_credentials.json

Create ROIs

This functionality is split (for now) into two separate scripts.

select ROIs

Find and define locations of regions of interest (ROIs) in the given PDF.

Generally, it is possible to draw ROIs (rectangles) manually but also to detect them automatically. The coordinates of ROIs are stored in a JSON file.

The tool is supposed to be run from the command line, as the control commands are entered there.

Control commands

  • Press q or Esc to exit editing and save the config file.
  • Press r to remove the last rectangle.

Run formhtr select-rois -h for details.

annotate ROIs

Specify the type of content for each rectangle.

The workflow is designed in a way that you can navigate over specified ROIs and assign them the expected type of their content. This is done by pressing appropriate control commands.

Control commands

  • Press q or Esc to exit editing and save the config file.
  • Press h to add "Handwritten" type to the current ROI.
  • Press c to add "Checkbox" type to the current ROI.
  • Press b to add "Barcode" type to the current ROI.
  • Press r or d to delete the type from the current ROI.
  • Press v to enter the variable name.
  • Press an arrow to navigate through ROIs (only left and right for now).

Run formhtr annotate-rois -h for details.

process logsheet

Extract values from specified ROIs.

This is the crucial step that applies various techniques to extract the information as precisely as possible. It can process one logsheet at a time, given the template and config files.

Run formhtr process-logsheet -h for details.

Credentials

The processing of logsheets is using external services requiring credentials to use them. Here we specify structure that is expected for credentials, always in JSON format.

Google

{
  "type": "service_account",
  "project_id": "theid",,
  "private_key_id": "thekey",
  "private_key": "-----BEGIN PRIVATE KEY-----anotherkey-----END PRIVATE KEY-----\n"
  "client_email": "emailaddress",
  "client_id": "id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://oauth2.googleapis.com/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "someurl",
  "universe_domain": "googleapis.com"
}

Amazon

{
    "ACCESS_KEY": "YOUR_KEY_ID_HERE",
    "SECRET_KEY": "YOUR_ACCESS_KEY_HERE",
    "REGION": "YOUR_REGION_NAME_HERE"
}

Microsoft

{
    "SUBSCRIPTION_KEY": "YOURKEYHERE",
    "ENDPOINT": "https://ENDPOINT"
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

formhtr-0.1.1.tar.gz (182.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

formhtr-0.1.1-py3-none-any.whl (188.2 kB view details)

Uploaded Python 3

File details

Details for the file formhtr-0.1.1.tar.gz.

File metadata

  • Download URL: formhtr-0.1.1.tar.gz
  • Upload date:
  • Size: 182.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for formhtr-0.1.1.tar.gz
Algorithm Hash digest
SHA256 90253004d93b45aa32b83a5ffcb10d46029e22656cf4362e802527f789e39fa0
MD5 f706bbdced823443908002ce966e9db5
BLAKE2b-256 1513a23a0780510af539faae376050da9e6821e779bac34e2011341819020672

See more details on using hashes here.

Provenance

The following attestation bundles were made for formhtr-0.1.1.tar.gz:

Publisher: publish_pypi.yml on grp-bork/formHTR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file formhtr-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: formhtr-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 188.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for formhtr-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b7201228da868c9258945a9b923ef89047da8720aafe54415e63b3b0e9663959
MD5 dc8a5025f202428401ccb758050a8ede
BLAKE2b-256 a49e76fb4a923fc88afac01fd79e91389b19f0fb1a8d3fd700fe05a924f89144

See more details on using hashes here.

Provenance

The following attestation bundles were made for formhtr-0.1.1-py3-none-any.whl:

Publisher: publish_pypi.yml on grp-bork/formHTR

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page