Analyze PDF and web forms and fill in the forms
Project description
formalyzer
Motivation
I am happy to write a recommendation letter “by hand” for a student. But then each graduate school has their own lengthy, idiosyncratic form, foisting upon me their job of data entry. This is tedious work, especially with many schools and several students. Thus, I’ve wanted to automate the form-filling for quite a while.
Description
Formalyzer will scrape the text from the PDF recc letter, and for each URL in url_list, it will:
- launch a browser tab for that url
- fill in the form using what the LLM has gleaned from the recc letter
- attach the PDF via the form’s upload/attachment button
…and do no more.
The user will need to review the page and press the Submit button manually.
Requirements
- Either
ollamainstalled locally orANTHROPIC_API_KEYenvironment variable set beautifulsoup4, playwright, claudette, lisette, pypdf, fastcore
Technical Approach
You could try to feed raw HTML and PDF into an LLM, but that might be
a waste of resources – prohibitively slow, expensive, and error-prone.
Instead, formalyzer uses
- standard packages to pre-process & reduce the inputs:
bs4for HTML,pypdffor PDF - the LLM only for reading the reduced input texts (+ a system prompt) and outputting values to assign to form fields.
- another existing package (
playwright) to fill in those fields.
Usage
On MacOS, startup the Chrome browser looking to port 9222 by executing this command in the terminal:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug
Then you can run this command:
formalyzer --debug <recc_info.txt> <recc_letter.pdf> <url_list.txt>
where recc_info.txt contains information about the recommender, their
name, their title, their address, phone number and email.
urls_list.txt is a file containing one URL per line.
Installation
Install latest from the GitHub repository:
$ pip install git+https://github.com/drscotthawley/formalyzer.git
or from pypi:
$ pip install formalyzer
After installing, users need to run playwright install chromium to
download the browser binaries.
Demo
On MacOS, run these commands in Terminal:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug &cd examplepython -m http.server 8000 &export ANTHROPIC_API_KEY="__your_API_key_goes_here__"formalyzer --debug recc_info.txt sample_letter.pdf sample_urls.txt
Local LLM Execution
For FERPA compliance, running a
local model is preferable so that student data is not broadcast
elsewhere. I recommend using ollama and starting
with something medium-small like qwen2.5:14b (9 GB). Start up ollama:
ollama serve &
ollama pull qwen2.5:14b
Then you can use the --model CLI flag, e.g.
formalyzer --debug --model 'ollama/qwen2.5:14b' recc_info.txt sample_letter.pdf sample_urls.txt
The quality of the form-filling will vary depending on the quality and
size of the model you get. Smaller models like mistral (4 GB) may
hallucinate many of the form field IDs, resulting in a mostly-blank form
in the end. For a huge (41 GB) model, try ollama/qwen2:72b.
Developer Guide
Install formalyzer in Development mode
# make sure formalyzer package is installed in development mode
$ pip install -e .
# make changes under nbs/ directory
# ...
# compile to have changes apply to formalyzer
$ nbdev_prepare
Documentation
Documentation can be found hosted on this GitHub repository’s pages. Additionally you can find package manager specific guidelines on conda and pypi respectively.
Limitations
Sometimes the LLM will miss certain fields – that’s just the nature of the game – so you’ll still need to fill those in by hand. But it gets most of them!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file formalyzer-0.0.5.tar.gz.
File metadata
- Download URL: formalyzer-0.0.5.tar.gz
- Upload date:
- Size: 16.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
203231c9e17c79eea80bbcdcb2d8e1b041144602b3395f3034cdd9af1ab57926
|
|
| MD5 |
9cb27d8c29e5583018b0d20561ebfb7c
|
|
| BLAKE2b-256 |
45ce5cfae3beeec9c379ce01268d25c9492014b83a2f56fc1b7648a389d83262
|
File details
Details for the file formalyzer-0.0.5-py3-none-any.whl.
File metadata
- Download URL: formalyzer-0.0.5-py3-none-any.whl
- Upload date:
- Size: 13.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db2a93a19a70ec021b26a67495b2d3c4d92200a204d0921c0713a511d9387861
|
|
| MD5 |
c7060ab7713c1bff846b6242dfe179ef
|
|
| BLAKE2b-256 |
49a2dba750ca1ebad671e63fbfa294c86f67f152ce67bf8e25bc0336536f1d6d
|