Skip to main content

Vision utilities for web interaction agents

Project description

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

Python

🔗 Main site   •   🐦 Twitter   •   📢 Discord

Tarsier

Tried using GPT-4(V) to automate web interactions? You've probably run into issues like these:

  • How do you map from an LLM's responses back to web elements?
  • How do you feed a "screenshot" to a text-only LLM?
  • How do you screen capture an entire page?

At Reworkd, we found ourselves reusing the same utils to solve these problems across multiple projects, so we're now open-sourcing a simple little utils library for multimodal web agents... Tarsier!

Tarsier visually tags elements on a page, allowing GPT-4V to specify by tag which element to click. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

An agent using Tarsier might look like this:

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    creds = {...} # Google Cloud credentials
    ocr_service = GoogleVisionOCRService(creds)
    tarsier = Tarsier(ocr_service)
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")
    
        driver = tarsier.create_driver(page)
        page_text, tag_to_xpath = await tarsier.page_to_text(driver)
        
        print(page_text) # My Text representation of the page
        print(tag_to_xpath) # Mapping of tags to x_paths

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/bananalyzer}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier-0.2.0.tar.gz (10.3 kB view details)

Uploaded Source

Built Distribution

tarsier-0.2.0-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file tarsier-0.2.0.tar.gz.

File metadata

  • Download URL: tarsier-0.2.0.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Darwin/23.0.0

File hashes

Hashes for tarsier-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f821b77c7de98c0334253069451b40787c0a37f83fa7bcad3b3d49d35d2eb5fd
MD5 a64d7894b67477ab0e157b3e33ff08d8
BLAKE2b-256 c04c9072136b78d0f22511f6594b1887863c0c21719de6068ac47e7eda6fae79

See more details on using hashes here.

File details

Details for the file tarsier-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: tarsier-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 12.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.5.0 CPython/3.11.3 Darwin/23.0.0

File hashes

Hashes for tarsier-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e5eab6240420af7a36e00d83cd36d39b14ac0c9f6d7237a06792e7821f28029
MD5 18dbedb6f8ae35cf0a9a315dc7ed6b40
BLAKE2b-256 af9c4fc8155c667e8f72a655d3812e87e2046bb067a4a5edfc2dfddd3be1377e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page