Skip to main content

Python package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX.

Project description

form-analyzer - A library that uses AWS Textract to automatically evaluate filled forms

Build Documentation Status Coverage Status Maintainability

Python package to analyze scanned questionnaires and forms with AWS Textract and convert the results to an XLSX.

No thorough Python programming abilities are required, but a basic understanding is needed.

Prerequisites

  • Install form-analyzer using pip
pip install form-analyzer

Example

For a comprehensive example, see the example folder in this project

Prepare questionnaires

In order to process your input data, the questionnaires need to be converted to a proper format. form-analyzer requires PNG files for the upload to AWS Textract. If your data is already in this format, make sure that their lexicographic order corresponds to the number of pages in your form.

Example:

Form1_Page1.png
Form1_Page2.png
Form1_Page3.png
Form2_Page1.png
Form2_Page2.png
Form2_Page3.png

Convert PDF files

form-analyzer can convert PDF input files to properly named PNG files ready for upload. Each PDF page can optionally be post-processed by a custom function to split pages.

Create a Python script like this to convert single page PDF files (assuming that the PDFs are located in the folder "questionnaires"):

import form_analyzer

form_analyzer.pdf_to_image('questionnaires')

The following example shows how to split a single PDF page into two images and how to return only the first page:

import form_analyzer


def one_page_to_two(_: int, image):
    left = image.crop((0, 0, image.width // 2, image.height))
    right = image.crop((image.width // 2, 0, image.width, image.height))

    return [form_analyzer.ProcessedImage(left, '_1'), form_analyzer.ProcessedImage(right, '_2')]


form_analyzer.pdf_to_image('questionnaires', image_processor=one_page_to_two)

form_analyzer.pdf_to_image('questionnaires', 
                           image_processor=lambda image_index, image: [form_analyzer.ProcessedImage(image, '') if image_index == 0 else None])

The argument image_processor specifies a function that receives the current PDF page number (starting with 0) and an Image object. It returns a list of form_analyzer.ProcessedImage objects that contain an Image object and a file name suffix. The list may also contain None, in which case the entry is skipped.

The resulting images are stored in the same folder as the PDF source files.

AWS Textract

The converted images can now be processed by AWS Textract to extract the form data. You can either provide your AWS access key and region as parameters or set them up according to this manual.

It is also possible to upload the images to an AWS S3 bucket and analyze them from there. If that's desired, pass the S3 bucket name and an optional sub folder.

Assuming that the credentials are already set, this script will upload and process the data.

import form_analyzer

form_analyzer.run_textract('questionnaires')

The result data is saved as JSON files in the target folder. Before using AWS Textract, the function checks if result data is already present. If that is the case, the Textract call is skipped.

Work with Textract only

If you do not need the form processing, you can also directly use the generated JSON files with Textract Response Parser.

import glob
import json
import trp

for file_name in glob.glob('*.json'):
    with open(file_name) as f:
        doc = trp.Document([json.load(f)])

    for block in doc.blocks[0]['Blocks']:
        print(block.get('Text'))

Form description

In order to convert your form to a meaningful Excel file, form-analyzer needs to know the expected form fields. A description has to be provided as a Python module.

This module needs to contain two variables:

  • form_fields: The list of form fields
  • keywords_per_page: A list of keywords to expect on each page

form_fields variable

This variable is a list of FormField objects, which each describes a single field in the form. Each FormField object consists of a title and a Selector object. The title is the column header in the Excel file and the Selector defines the type of the form field and its location.

Important: Note that the form description greatly affects the result of the form analyzing process. The AWS Textract process often has slight errors and does not yield 100% correct results. The form descriptions needs to account for that and on the one hand provide a detailed description of where to look for form fields and on the other hand needs to keep search strings generic to help to detect the correct field.

Selectors

Some selectors require a key and all require filter for initialization. The key is the label of the form field which is searched in the extracted form data. It is recommended to not indicate the full label but a unique part of it to compensate for potential detection errors.

  • SingleSelect: Describes a list of checkboxes where only one may be marked
  • MultiSelect: Describes a list of checkboxes where none, one or several may be marked
  • TextField: Describes a text input box or input line where free text can be entered
  • TextFieldWithCheckbox: Describes a text input field with an additional checkbox
  • Number: Special case of TextField where only numbers may be entered
  • Placeholder: Results in an empty column in the Excel file

For single and multi selects, additional and alternative text fields can be given. The content of the additional field is always added to the output and can be used to handle optional free text fields. The alternative text field is used when no selection is made. Both additional and alternative fields can be either TextField, Number or TextFieldWithCheckbox.

Note that all text matching will be done case-insensitive and with a certain fuzziness, so that no exact match is required.

See also the documentation.

Filters

Filters restrict the extracted form fields to search for the current form field. The lower the number of potential extracted form fields, the higher the probability of correct results.

Filters can be combined using the & (and) and | (or) operator.

  • Page: Restricts the search to a certain page (page numbers starting with 0, so 0 is the first page)
  • Pages: Restricts the search to a list of pages
  • Location: Restricts the search to a part of the page indicated by horizontal and vertical ranges as page fractions.
  • Selected: Restricts the search to fields which are selected checkboxes

Location filters apply to all selection possibilities for single and multi selects and to the label for text and number fields.

Note that when working with location filters and scanned form pages, the position of certain fields on the page must be similar for each scan.

See also the documentation.

Examples

from form_analyzer.filters import *
from form_analyzer.selectors import *

# Single select on the first page with two options
single_select = SingleSelect(['First option', 'Second option'], 
                             Page(0))

# Multi select on the top half of the first page
multi_select = MultiSelect(['First option', 'Second option'],
                           Page(0) & Location(vertical=(.0, .5)))

# Text field on the upper left quarter of the first page
text_field = TextField('Field label',
                       Page(0) & Location(horizontal=(.0, .5), vertical=(.0, .5)))

# Single select on the lowest third of the second page or the top half of the third page
single_select_2 = SingleSelect(['First option', 'Second option', 'Third option'],
                               (Page(1) & Location(vertical=(.66, 1))) |
                               (Page(2) & Location(vertical=(.0, .5))))

Keywords per page

The variable keywords_per_page in the form description is used to validate that a correct form is being analyzed. It is a list of a list of strings. For each page, a list of strings can be given where at least one of them has to be found in the strings discovered by Textract on the page.

If the list is empty or empty for a single page, no validation is performed.

Example

# Will search for 'welcome' on the first page and for 'future' or 'past' on the second
keywords_per_page = [['welcome'], ['future', 'past']]

Form analysis

The data returned from AWS Textract and the form description are the inputs for the final analysis step that will try to locate all described form fields, get their value in the respective filled forms and put this in an Excel file.

To run the analysis, use the following where the AWS Textract JSON files and PNGs are located in the folder "questionnaires" and a Python module "my_form" exists in the Python search path that contains the form description (this should usually be the current folder, where a "my_form.py" is located). You can optionally pass the name of the resulting Excel file.

import form_analyzer

form_analyzer.analyze('questionnaires', 'my_form', 'my_form_results')

Results

After analyzing, an Excel file is created. The first column always contains a link to the image of the first page of the form. Each uncertain field (meaning that there was some uncertainty during the analysis and the result might be incorrect) is also linked to the image of the page where the field is located.

Usually, it is required to manually check the results. The Excel file is not perfect and depending on the complexity of the form, the quality of the inputs, the PDF quality etc. the file might contain errors. The number of found uncertain fields is printed after the analysis and can be used as a coarse measure for the quality of the results.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

form-analyzer-0.1.2.tar.gz (19.8 kB view details)

Uploaded Source

Built Distribution

form_analyzer-0.1.2-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file form-analyzer-0.1.2.tar.gz.

File metadata

  • Download URL: form-analyzer-0.1.2.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for form-analyzer-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0919250bfc4d04b8bb3333d85e00282b6d25ba395b70cc4fd94a70098b971e9e
MD5 0925014a446b8bbb92077f2cd0ab8eff
BLAKE2b-256 e6d1e0a5f3d42ea752378b26582dc77d044ac659188193105ebeeb572646dbb8

See more details on using hashes here.

File details

Details for the file form_analyzer-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for form_analyzer-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4e76561088bb592112a69d49816328741d7cd0f3296489d13f6474bf13c4a6c0
MD5 f2b55f5e1c0435c0dabc8db7276a50a3
BLAKE2b-256 25cbbd71785092b35780834248092deaee749d2d7a8b1d878c0a36ca5947c52b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page