Skip to main content

Vision utilities for web interaction agents

Project description

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

Python

🔗 Main site   •   🐦 Twitter   •   📢 Discord

Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

  • How do you map LLM responses back into web elements?
  • How can you mark up a page for an LLM better understand its action space?
  • How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier!

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

An agent using Tarsier might look like this:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    google_cloud_credentials = {}

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        driver = tarsier.create_driver(page)
        page_text, tag_to_xpath = await tarsier.page_to_text(driver)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Visit our cookbook for additional examples:

  • A LangChain web agent
  • A LlamaIndex web agent

Roadmap

  • Add documentation and examples

  • Clean up interfaces and add unit tests

  • Launch

  • Improve OCR text performance

  • Add options to customize tagging

  • Add support for other browsers drivers as necessary

  • Add support for other OCR services as necessary

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/bananalyzer}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier-0.2.2.tar.gz (11.0 kB view details)

Uploaded Source

Built Distribution

tarsier-0.2.2-py3-none-any.whl (12.9 kB view details)

Uploaded Python 3

File details

Details for the file tarsier-0.2.2.tar.gz.

File metadata

  • Download URL: tarsier-0.2.2.tar.gz
  • Upload date:
  • Size: 11.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.2.0-1015-azure

File hashes

Hashes for tarsier-0.2.2.tar.gz
Algorithm Hash digest
SHA256 0acc3920908a6a1ecfecf3043947c18f292fbf711a4edbb94dedee326da89c71
MD5 b3ee12a9374413a88c6a5095c528bf83
BLAKE2b-256 99ba2fd181df120a8b360e066aeb17bd542c7dc7ff1ba004d657a72355133e28

See more details on using hashes here.

File details

Details for the file tarsier-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: tarsier-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 12.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.2.0-1015-azure

File hashes

Hashes for tarsier-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 b9b9e53bf31f0f502a33dc4ec9fe5eb8e9b41c609363d08f051e6110f2dee051
MD5 ce308e65a07ebeedfe2c0b92147c3b36
BLAKE2b-256 737c299cad6bec52ba5a533cab9e8d3002ca3589e494129d1df96cca1020ecf1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page