Skip to main content

Analyze PDF and web forms and fill in the forms

Project description

formalyzer

Motivation

I am happy to write a recommendation letter “by hand” for a student. But then each graduate school has their own lengthy, idiosyncratic form, foisting upon me their job of data entry. This is tedious work, especially with many schools and several students. Thus, I’ve wanted to automate the form-filling for quite a while.

Description

Formalyzer will scrape the text from the PDF recc letter, and for each URL in url_list, it will:

  • launch a browser tab for that url
  • fill in the form using what the LLM has gleaned from the recc letter
  • attach the PDF via the form’s upload/attachment button

…and do no more.

The user will need to review the page and press the Submit button manually.

Requirements

  • Either ollama installed locally or ANTHROPIC_API_KEY environment variable set
  • beautifulsoup4, playwright, claudette, lisette, pypdf, fastcore

Technical Approach

You could try to feed raw HTML and PDF into an LLM, but that might be a waste of resources – prohibitively slow, expensive, and error-prone. Instead, formalyzer uses

  • standard packages to pre-process & reduce the inputs: bs4 for HTML, pypdf for PDF
  • the LLM only for reading the reduced input texts (+ a system prompt) and outputting values to assign to form fields.
  • another existing package (playwright) to fill in those fields.

Usage

On MacOS, startup the Chrome browser looking to port 9222 by executing this command in the terminal:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug

Then you can run this command:

formalyzer --debug <recc_info.txt> <recc_letter.pdf> <url_list.txt>

where recc_info.txt contains information about the recommender, their name, their title, their address, phone number and email. urls_list.txt is a file containing one URL per line.

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/drscotthawley/formalyzer.git

or from pypi:

$ pip install formalyzer

After installing, users need to run playwright install chromium to download the browser binaries.

Demo

On MacOS, run these commands in Terminal:

  1. /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug &
  2. cd example
  3. python -m http.server 8000 &
  4. export ANTHROPIC_API_KEY="__your_API_key_goes_here__"
  5. formalyzer --debug recc_info.txt sample_letter.pdf sample_urls.txt

Local LLM Execution

For FERPA compliance, running a local model is preferable so that student data is not broadcast elsewhere. I recommend using ollama and starting with something medium-small like qwen2.5:14b (9 GB). Start up ollama:

ollama serve & 
ollama pull qwen2.5:14b 

Then you can use the --model CLI flag, e.g. 

formalyzer --debug --model 'ollama/qwen2.5:14b' recc_info.txt sample_letter.pdf sample_urls.txt

The quality of the form-filling will vary depending on the quality and size of the model you get. Smaller models like mistral (4 GB) may hallucinate many of the form field IDs, resulting in a mostly-blank form in the end. For a huge (41 GB) model, try ollama/qwen2:72b.

Developer Guide

Install formalyzer in Development mode

# make sure formalyzer package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to formalyzer
$ nbdev_prepare

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

Limitations

Sometimes the LLM will miss certain fields – that’s just the nature of the game – so you’ll still need to fill those in by hand. But it gets most of them!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

formalyzer-0.0.4.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

formalyzer-0.0.4-py3-none-any.whl (12.8 kB view details)

Uploaded Python 3

File details

Details for the file formalyzer-0.0.4.tar.gz.

File metadata

  • Download URL: formalyzer-0.0.4.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for formalyzer-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ec57647211f2153fb5ffadb039c355a3aff6416a987ef2eee82dbbc2aaea0e47
MD5 58a8ded13dbc29dac4039168828f5f33
BLAKE2b-256 2ea8b75cdb5b650152f2d2ac8bea2d05ebc033be9db8d7337d652f1bf7c86c4d

See more details on using hashes here.

File details

Details for the file formalyzer-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: formalyzer-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 12.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for formalyzer-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 beec2a5943b5a93231b26bbdbc2d08bb4ea237c4ffa11f1aec5ef148ba53e83f
MD5 cf35c6358d86d403c66c810739d1da08
BLAKE2b-256 b52720b96157fde9b2a57a8112b9bfc8dccfe01fd0fb7276f8588ad354fa397c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page