Skip to main content

Vision utilities for web interaction agents

Project description

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

Python

🔗 Main site   •   🐦 Twitter   •   📢 Discord

Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

  • How do you map LLM responses back into web elements?
  • How can you mark up a page for an LLM better understand its action space?
  • How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier!

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

An agent using Tarsier might look like this:

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    google_cloud_credentials = {...}
    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)
    
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")
    
        driver = tarsier.create_driver(page)
        page_text, tag_to_xpath = await tarsier.page_to_text(driver)
        
        print(page_text) # My Text representation of the page
        print(tag_to_xpath) # Mapping of tags to x_paths

Visit our cookbook for additional examples:

  • A LangChain web agent
  • A LlamaIndex web agent

Roadmap

  • Add documentation and examples

  • Clean up interfaces and add unit tests

  • Launch

  • Improve OCR text performance

  • Add options to customize tagging

  • Add support for other browsers drivers as necessary

  • Add support for other OCR services as necessary

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/bananalyzer}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier-0.2.1.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

tarsier-0.2.1-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file tarsier-0.2.1.tar.gz.

File metadata

  • Download URL: tarsier-0.2.1.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.2.0-1015-azure

File hashes

Hashes for tarsier-0.2.1.tar.gz
Algorithm Hash digest
SHA256 0038d0b2c570c727bb661184a28f72b34457117fe2940486d65df8f8928ec609
MD5 10dcaa56d4b91947c7172a0961c8ef2a
BLAKE2b-256 d3e4c377c7f37fa386d1b888829957ccc2b5be7d612d51d0df52bfa4d81fa574

See more details on using hashes here.

File details

Details for the file tarsier-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: tarsier-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.2.0-1015-azure

File hashes

Hashes for tarsier-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ea1e618467cb291fa45e67046430c9353edad6507b875d14f234b6df868e14b8
MD5 bc318829e82a88bb5e272e1a38061b4b
BLAKE2b-256 ccca253cc84ed1df82520c726a76dcfd0a3f9f4b143736cf417a69eec4590c3b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page