Skip to main content

Analyze PDF and web forms and fill in the forms

Project description

formalyzer

Description:

Formalyzer will scrape the text from the PDF recc letter, and for each URL in url_list, it will:

  • launch a browser tab for that url
  • fill in the form using what the LLM has gleaned from the recc letter
  • attach the PDF via the form’s upload/attachment button

…and do no more.

The user will need to review the page and press the Submit button manually.

Requirements:

  • Either ollama installed locally or ANTHROPIC_API_KEY environment variable set
  • beautifulsoup4, playwright, claudette, lisette, pypdf, fastcore

Usage

On MacOS, startup the Chrome browser looking to port 9222 by executing this command in the terminal:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug

Then you can run this command:

formalyzer --debug <recc_info.txt> <recc_letter.pdf> <url_list.txt>

where recc_info.txt contains information about the recommender, their name, their title, their address, phone number and email. urls_list.txt is a file containing one URL per line.

Installation

Install latest from the GitHub repository:

$ pip install git+https://github.com/drscotthawley/formalyzer.git

or from conda

$ conda install -c drscotthawley formalyzer

or from pypi

$ pip install formalyzer

After installing, users need to run playwright install chromium to download the browser binaries.

Demo

Using example/ data. On MacOS, from the main formalyzer package directory:

  1. Start up Chrome: /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug
  2. Launch a local web server: python -m http.server 8000 --directory example/
  3. Set your ANTHROPIC_API_KEY shell environment variable.
  4. Run the script: formalyzer --debug example/recc_info.txt example/sample_letter.pdf example/sample_urls.txt

Local LLM Execution

For FERPA compliance, running a local model is preferable so that student data is not broadcast elsewhere. I recommend using ollama and starting with something medium-small like qwen2.5:14b (9 GB). Start up ollama:

ollama serve & 
ollama pull qwen2.5:14b 

Then you can use the --model CLI flag, e.g. 

formalyzer --debug -model 'ollama/qwen2.5:14b' example/recc_info.txt example/sample_letter.pdf example/sample_urls.txt

The quality of the form-filling will vary depending on the quality and size of the model you get. Smaller models like mistral (4 GB) may hallucinate many of the form field IDs, resulting in a mostly-blank form in the end. For a huge (41 GB) model, try ollama/qwen2:72b.

Developer Guide

Install formalyzer in Development mode

# make sure formalyzer package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to formalyzer
$ nbdev_prepare

Documentation

Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.

TODO:

  • Test with a less-than-superlative recc letter – to make sure it’s not just always selecting the top rating(s).
  • Enable switching from Anthropic API to local LLM and/or CoPilot API (if possible)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

formalyzer-0.0.1.tar.gz (13.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

formalyzer-0.0.1-py3-none-any.whl (12.1 kB view details)

Uploaded Python 3

File details

Details for the file formalyzer-0.0.1.tar.gz.

File metadata

  • Download URL: formalyzer-0.0.1.tar.gz
  • Upload date:
  • Size: 13.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for formalyzer-0.0.1.tar.gz
Algorithm Hash digest
SHA256 bc1292bb1337a51709a09ce4bb040d3482242253bcc5d6d22b02de85498ce06c
MD5 d7128ebb7db60a5fcef534c82010877d
BLAKE2b-256 5c7c4f920dabb798d18d09aa399649793e611eb7e9dce10ce287cb66018a2338

See more details on using hashes here.

File details

Details for the file formalyzer-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: formalyzer-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 12.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for formalyzer-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 49d5c3eacfbc62e12a10eac0e81bd7179c91e8168b9706b5c25e7ae3cefcd6a2
MD5 4c6b3b4841c2c1fbe0a280c6c126f3fe
BLAKE2b-256 b6d6d59a29034af5f9b1d2e1da0df33e559035ad7522239cc3c91ea4573dc988

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page