Vision utilities for web interaction agents
Project description
🙈 Vision utilities for web interaction agents 🙈
🔗 Main site • 🐦 Twitter • 📢 Discord
Tarsier
Tried using GPT-4(V) to automate web interactions? You've probably run into issues like these:
- How do you map from an LLM's responses back to web elements?
- How do you feed a "screenshot" to a text-only LLM?
- How do you screen capture an entire page?
At Reworkd, we found ourselves reusing the same utils to solve these problems across multiple projects, so we're now open-sourcing a simple little utils library for multimodal web agents... Tarsier!
Tarsier visually tags elements on a page, allowing GPT-4V to specify by tag which element to click. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.
Installation
pip install tarsier
Usage
An agent using Tarsier might look like this:
from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
async def main():
creds = {...} # Google Cloud credentials
ocr_service = GoogleVisionOCRService(creds)
tarsier = Tarsier(ocr_service)
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
page = await browser.new_page()
await page.goto("https://news.ycombinator.com")
driver = tarsier.create_driver(page)
page_text, tag_to_xpath = await tarsier.page_to_text(driver)
print(page_text) # My Text representation of the page
print(tag_to_xpath) # Mapping of tags to x_paths
Citations
bibtex
@misc{reworkd2023tarsier,
title = {Tarsier},
author = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
year = {2023},
howpublished = {GitHub},
url = {https://github.com/reworkd/bananalyzer}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tarsier-0.2.0.tar.gz
.
File metadata
- Download URL: tarsier-0.2.0.tar.gz
- Upload date:
- Size: 10.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.0 CPython/3.11.3 Darwin/23.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f821b77c7de98c0334253069451b40787c0a37f83fa7bcad3b3d49d35d2eb5fd |
|
MD5 | a64d7894b67477ab0e157b3e33ff08d8 |
|
BLAKE2b-256 | c04c9072136b78d0f22511f6594b1887863c0c21719de6068ac47e7eda6fae79 |
File details
Details for the file tarsier-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: tarsier-0.2.0-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.0 CPython/3.11.3 Darwin/23.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e5eab6240420af7a36e00d83cd36d39b14ac0c9f6d7237a06792e7821f28029 |
|
MD5 | 18dbedb6f8ae35cf0a9a315dc7ed6b40 |
|
BLAKE2b-256 | af9c4fc8155c667e8f72a655d3812e87e2046bb067a4a5edfc2dfddd3be1377e |