Vision utilities for web interaction agents
Project description
🙈 Vision utilities for web interaction agents 🙈
🔗 Main site • 🐦 Twitter • 📢 Discord
Tarsier
If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:
- How do you map LLM responses back into web elements?
- How can you mark up a page for an LLM better understand its action space?
- How do you feed a "screenshot" to a text-only LLM?
At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.
https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b
How does it work?
Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]
.
In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon.
We define interactable elements as buttons, links, or input fields that are visible on the page.
Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.
Installation
pip install tarsier
Usage
Visit our cookbook for agent examples using Tarsier:
Otherwise, basic Tarsier usage might look like the following:
import asyncio
from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
async def main():
google_cloud_credentials = {}
ocr_service = GoogleVisionOCRService(google_cloud_credentials)
tarsier = Tarsier(ocr_service)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://news.ycombinator.com")
page_text, tag_to_xpath = await tarsier.page_to_text(page)
print(tag_to_xpath) # Mapping of tags to x_paths
print(page_text) # My Text representation of the page
if __name__ == '__main__':
asyncio.run(main())
Keep in mind that Tarsier tags different types of elements differently to help your LLM identify what actions are performable on each element. Specifically:
[#ID]
: text-insertable fields (e.g.textarea
,input
with textual type)[@ID]
: hyperlinks (<a>
tags)[$ID]
: other interactable elements (e.g.button
,select
)[ID]
: plain text (if you passtag_text_elements=True
)
Local Development
Setup
We have provided a handy setup script to get you up and running with Tarsier development.
./script/setup.sh
If you modify any TypeScript files used by Tarsier, you'll need to execute the following command. This compiles the TypeScript into JavaScript, which can then be utilized in the Python package.
npm run build
Testing
We use pytest for testing. To run the tests, simply run:
poetry run pytest .
Linting
Prior to submitting a potential PR, please run the following to format your code:
./script/format.sh
Supported OCR Services
- Google Cloud Vision
- Amazon Textract (Coming Soon)
- Microsoft Azure Computer Vision (Coming Soon)
Roadmap
- Add documentation and examples
- Clean up interfaces and add unit tests
- Launch
- Improve OCR text performance
- Add options to customize tagging styling
- Add support for other browsers drivers as necessary
- Add support for other OCR services as necessary
Citations
bibtex
@misc{reworkd2023tarsier,
title = {Tarsier},
author = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
year = {2023},
howpublished = {GitHub},
url = {https://github.com/reworkd/tarsier}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tarsier-0.5.1.tar.gz
.
File metadata
- Download URL: tarsier-0.5.1.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.2.0-1016-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f85966cd7a68790396a0451748103a7d96c8a8ebd11979d218226768939dffe2 |
|
MD5 | 55c59bd75f39f21cdbeb651882447b9b |
|
BLAKE2b-256 | 7ea50e3b616968f047b867ad02b7fa614594e25c118c36dbd989a41f51bd9e3c |
File details
Details for the file tarsier-0.5.1-py3-none-any.whl
.
File metadata
- Download URL: tarsier-0.5.1-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.2.0-1016-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ceea24009613ff4ecee336dfbd7bfb7594fe518b4e16f85739325343ec357890 |
|
MD5 | f978d6d73c58553f038eed6c66f193be |
|
BLAKE2b-256 | f4be59937fd70bea083a606294fb02a9cefbb2c56be00ebb29939733e04fbbb2 |