Skip to main content

No project description provided

Project description

form-tools

The raw data for many case management and data systems exist as paper forms. form-tools is a package to help with preprocessing scanned images of these paper forms for further analysis and / or processing. It does this by making use of a template for the form to match and align scanned versions of the document to it, before taking thumbnails of the fields in the scanned document.

Before you begin

  • form-tools makes use of the pdf2image package for converting document images stored as pdf to image files. As such, you'll need to install poppler. See the pdf2image readme for guidance on how to do so.
  • The current default OCR engine for matching pages in a form template to its scanned image is tesseract. Please follow the instructions at the link for how to install it.
  • Computer vision is performed by using the opencv library. This project makes use of the pre-compiled python library for opencv which will be installed by default but you may wish to install opencv from source instead.

On Ubuntu, you can install all the necessary packages by running

sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config poppler-utils

You will also need to specify the location of the test data for tesseract before using the library. You can do this by setting the TESSDATA_PREFIX environment variable. To locate the tessdata directory on a mac run brew list tesseract. On linux the data should be located at /usr/share/tesseract-ocr/4.00/tessdata/.

Installation

To install the library run:

pip install form-tools

Basic use

Extracting form metadata

Say you have a form with a pdf template my_form.pdf. To pre-process scanned copies of the form you'll first need to create an image directory for your template as well as a FormMetadata compliant json file.

To do this from the command line and output the metadata to my_form_meta.json and your images to a directory template_images you would run:

form-tools extract-meta my_form.pdf my_form_meta.json --form-image-directory template_images

To interact with the API directly in python you should use the built in PdfFormMetaExtractor class.

from form_tools.form_meta.extractors.pdf_form_extractor import PdfFormMetaExtractor

# Instantiate extractor
pfme = PdfFormMetaExtractor()

# Create FormMetadata object and populate
# image directory template_images
form_metadata = pfme.extract_meta(
    form_template_path="my_form.pdf",
    form_image_dir="template_images"
)

# Write FormMetadata to json file
form_metadata.to_json(
    "my_form_meta.json",
)

The output metadata should contain bounding box coordinates for each field in the form that correspond to regions in the images outputted to template_images.

Note: The output metadata will not be able to be used immediately to align a scanned image to the template as the form_identifier key and identifier key for each form_page in the metadata will need to be populated with a valid regular expression so that the correct page in the scanned image can be compared with the correct page in the template images.

Aligning scanned images to a template

Once you have a complete form metadata file for your template and a populated image directory you can attempt to align a scanned form, say my_scanned_form.pdf to the template and extract field thumbnails.

You will first need to prepare a config file to specify the opencv algorithms to use for the alignment process. An example config.yaml would be as follows:

detector:
  name: SIFT
matcher:
  id: FLANN
  args:
    - algorithm: 1
      trees: 5
    - check: 50
knn: 2
proportion: 0.7
ocr_options:
  rotation_engine: tesseract
  text_extraction_engine: tesseract
pass_directory: s3://my-bucket/pass_directory
fail_directory: s3://my-bucket/fail_directory
form_metadata_directory: metadata

This config specifies that the SIFT algorithm should be used for keypoint detection and the FLANN algorithm should be used for keypoint matching, with 70% of the best keypoints kept (using KNN to decide on which of these are best). Also, note that we've put the output metadata in a metadata subdirectory in our working directory.

To align the scanned image from the command line you would then run:

form-tools process-form my_scanned_form.pdf config.yaml

To interact with the API directly in python you would use the FormOperator class.

from form_tools.form_operators import FormOperator

form_operator = FormOperator.create_from_config("config.yaml")

_ = form_operator.run_full_pipeline(
    form_path="my_scanned_form.pdf",
    pass_dir="s3://my-bucket/pass_directory",
    fail_dir="s3://my-bucket/fail_directory",
    form_meta_directory="metadata",
)

Note: The scanned image could be stored in an AWS S3 bucket. In that case you would pass the S3 path (e.g. s3://my-bucket/my_scanned_form.pdf). Only the config and metadata directory need to be located in your local working directory.

Running documentation locally

mkdocs is used to document form-tools. To run the documentation locally, run mkdocs serve on the command line and follow the link to the local host.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

form_tools-0.2.0.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

form_tools-0.2.0-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file form_tools-0.2.0.tar.gz.

File metadata

  • Download URL: form_tools-0.2.0.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.19 Linux/6.5.0-1024-azure

File hashes

Hashes for form_tools-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c800bc76a1d7b6c0728643fab079364b1f84c2642832d77dd16f33c451043dcb
MD5 967f0c65fef13b915784fb165eb79f9e
BLAKE2b-256 b97ade339ee6de1c6e5bc37f79176e3fe835bc099fb1dc3e716abe03ea0a37d5

See more details on using hashes here.

File details

Details for the file form_tools-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: form_tools-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.19 Linux/6.5.0-1024-azure

File hashes

Hashes for form_tools-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0eb93787bda36993165fe92d18f582a8c0bc71f715a4fe89ce6e81f96b688325
MD5 7bcc1e2e353e8c2e7b4d6af65bd72807
BLAKE2b-256 da176d55bb2424d4ad0592107dcb448f8535a3e50acad670a1d59ea84c8b27ff

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page