Skip to main content

Vision utilities for web interaction agents

Project description

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

Python

🔗 Main site   •   🐦 Twitter   •   📢 Discord

Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

  • How do you map LLM responses back into web elements?
  • How can you mark up a page for an LLM better understand its action space?
  • How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

Visit our cookbook for agent examples using Tarsier:

  • An autonomous LangChain web agent 🦜⛓️
  • A autonomous LlamaIndex web agent 🦙

Basic Tarsier usage might look like the following:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    google_cloud_credentials = {}

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Roadmap

  • Add documentation and examples

  • Clean up interfaces and add unit tests

  • Launch

  • Improve OCR text performance

  • Add options to customize tagging styling

  • Add support for other browsers drivers as necessary

  • Add support for other OCR services as necessary

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/bananalyzer}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier-0.3.0.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

tarsier-0.3.0-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file tarsier-0.3.0.tar.gz.

File metadata

  • Download URL: tarsier-0.3.0.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.2.0-1015-azure

File hashes

Hashes for tarsier-0.3.0.tar.gz
Algorithm Hash digest
SHA256 c8982774a3fa04a5918054f1f70320bbfbe93d86a31bb0327471ac0588bb4c39
MD5 7f661bbd5fd3cca93dd0c334e8432b5c
BLAKE2b-256 289fa45505b297f25d8882aeadcc08e83973e71c11289699d4de967da6389f4b

See more details on using hashes here.

File details

Details for the file tarsier-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: tarsier-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.2.0-1015-azure

File hashes

Hashes for tarsier-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 516f1606bc975ea01d4e88d58b94cb189c751bfa5d307a53a7d7bd762657b253
MD5 9e24a9eb72524d810bf43e8e06613125
BLAKE2b-256 2a9f77605b26d45287de5cd0eb7afd1f39456516fc65ac746d3d8e4b3c66d865

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page