Skip to main content

Vision utilities for web interaction agents

Project description

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

Python Version

🔗 Main site   •   🐦 Twitter   •   📢 Discord

Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

  • How do you map LLM responses back into web elements?
  • How can you mark up a page for an LLM better understand its action space?
  • How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

Visit our cookbook for agent examples using Tarsier:

Otherwise, basic Tarsier usage might look like the following:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    google_cloud_credentials = {}

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Supported OCR Services

Roadmap

  • Add documentation and examples
  • Clean up interfaces and add unit tests
  • Launch
  • Improve OCR text performance
  • Add options to customize tagging styling
  • Add support for other browsers drivers as necessary
  • Add support for other OCR services as necessary

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/tarsier}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier-0.4.0.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

tarsier-0.4.0-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file tarsier-0.4.0.tar.gz.

File metadata

  • Download URL: tarsier-0.4.0.tar.gz
  • Upload date:
  • Size: 11.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.2.0-1015-azure

File hashes

Hashes for tarsier-0.4.0.tar.gz
Algorithm Hash digest
SHA256 921cf58fce32d35786d3a0410ae2b16a943cb84392a5086be61d66993b3e2c47
MD5 9ac4d0f2514aebb7e12d7e2671598e15
BLAKE2b-256 eef6d4ddca69049623269fd7ad2f1f2260fbfd6035b23baaf78d4a3eb18839db

See more details on using hashes here.

File details

Details for the file tarsier-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: tarsier-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.0 CPython/3.10.12 Linux/6.2.0-1015-azure

File hashes

Hashes for tarsier-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8a96ce5fac07975c457627b6d69153bfddbbe0bbbff3df13857fd95967328002
MD5 62c10b726e00a9cfafbaf9a2c250cb22
BLAKE2b-256 e0b0919a4f415c0d09f97c01bdb508f23d2917407ad7d607390f73b11c2bf5ce

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page