Skip to main content

Analyze PDF and web forms and fill in the forms

Project description

formalyzer

Motivation

I am happy to write a recommendation letter “by hand” for a student. But then each graduate school has their own lengthy, idiosyncratic form, foisting upon me their job of data entry. This is tedious work, especially with many schools and several students. Thus, I’ve wanted to automate the form-filling for quite a while.

Description

Formalyzer will scrape the text from the PDF recc letter, and for each URL in url_list, it will:

  • launch a browser tab for that url
  • fill in the form using what the LLM has gleaned from the recc letter
  • attach the PDF via the form’s upload/attachment button

…and do no more.

The user will need to review the page and press the Submit button manually.

Requirements

  • Either ollama installed locally or ANTHROPIC_API_KEY environment variable set
  • beautifulsoup4, playwright, claudette, lisette, pypdf, fastcore

Technical Approach

You could try to feed raw HTML and PDF into an LLM, but that might be a waste of resources – prohibitively slow, expensive, and error-prone. Instead, formalyzer uses

  • standard packages to pre-process & reduce the inputs: bs4 for HTML, pypdf for PDF
  • the LLM only for reading the reduced input texts (+ a system prompt) and outputting values to assign to form fields.
  • another existing package (playwright) to fill in those fields.

Usage

On MacOS, startup the Chrome browser looking to port 9222 by executing this command in the terminal:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug

Then you can run this command:

formalyzer --debug <recc_info.txt> <recc_letter.pdf> <url_list.txt>

where recc_info.txt contains information about the recommender, their name, their title, their address, phone number and email. urls_list.txt is a file containing one URL per line.

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/drscotthawley/formalyzer.git

or from pypi:

$ pip install formalyzer

After installing, users need to run playwright install chromium to download the browser binaries.

Demo

On MacOS, run these commands in Terminal:

  1. /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug &
  2. cd example
  3. python -m http.server 8000 &
  4. export ANTHROPIC_API_KEY="__your_API_key_goes_here__"
  5. formalyzer --debug recc_info.txt sample_letter.pdf sample_urls.txt

Local LLM Execution

For FERPA compliance, running a local model is preferable so that student data is not broadcast elsewhere. I recommend using ollama and starting with something medium-small like qwen2.5:14b (9 GB). Start up ollama:

ollama serve & 
ollama pull qwen2.5:14b 

Then you can use the --model CLI flag, e.g. 

formalyzer --debug --model 'ollama/qwen2.5:14b' recc_info.txt sample_letter.pdf sample_urls.txt

The quality of the form-filling will vary depending on the quality and size of the model you get. Smaller models like mistral (4 GB) may hallucinate many of the form field IDs, resulting in a mostly-blank form in the end. For a huge (41 GB) model, try ollama/qwen2:72b.

Developer Guide

Install formalyzer in Development mode

# make sure formalyzer package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to formalyzer
$ nbdev_prepare

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

Limitations

Sometimes the LLM will miss certain fields – that’s just the nature of the game – so you’ll still need to fill those in by hand. But it gets most of them!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

formalyzer-0.0.5.tar.gz (16.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

formalyzer-0.0.5-py3-none-any.whl (13.8 kB view details)

Uploaded Python 3

File details

Details for the file formalyzer-0.0.5.tar.gz.

File metadata

  • Download URL: formalyzer-0.0.5.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for formalyzer-0.0.5.tar.gz
Algorithm Hash digest
SHA256 203231c9e17c79eea80bbcdcb2d8e1b041144602b3395f3034cdd9af1ab57926
MD5 9cb27d8c29e5583018b0d20561ebfb7c
BLAKE2b-256 45ce5cfae3beeec9c379ce01268d25c9492014b83a2f56fc1b7648a389d83262

See more details on using hashes here.

File details

Details for the file formalyzer-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: formalyzer-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 13.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for formalyzer-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 db2a93a19a70ec021b26a67495b2d3c4d92200a204d0921c0713a511d9387861
MD5 c7060ab7713c1bff846b6242dfe179ef
BLAKE2b-256 49a2dba750ca1ebad671e63fbfa294c86f67f152ce67bf8e25bc0336536f1d6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page