Vision utilities for web interaction agents

Project description

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

How do you map LLM responses back into web elements?
How can you mark up a page for an LLM better understand its action space?
How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

Visit our cookbook for agent examples using Tarsier:

An autonomous LangChain web agent 🦜⛓️
An autonomous LlamaIndex web agent 🦙

Otherwise, basic Tarsier usage might look like the following:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService
import json

def load_google_cloud_credentials(json_file_path):
    with open(json_file_path) as f:
        credentials = json.load(f)
    return credentials

async def main():
    # To create the service account key, follow the instructions on this SO answer https://stackoverflow.com/a/46290808/1780891
    google_cloud_credentials = load_google_cloud_credentials('./google_service_acc_key.json')

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Keep in mind that Tarsier tags different types of elements differently to help your LLM identify what actions are performable on each element. Specifically:

[#ID]: text-insertable fields (e.g. textarea, input with textual type)
[@ID]: hyperlinks (<a> tags)
[$ID]: other interactable elements (e.g. button, select)
[ID]: plain text (if you pass tag_text_elements=True)

Local Development

Setup

We have provided a handy setup script to get you up and running with Tarsier development.

./script/setup.sh

If you modify any TypeScript files used by Tarsier, you'll need to execute the following command. This compiles the TypeScript into JavaScript, which can then be utilized in the Python package.

npm run build

Testing

We use pytest for testing. To run the tests, simply run:

poetry run pytest .

Linting

Prior to submitting a potential PR, please run the following to format your code:

./script/format.sh

Supported OCR Services

Google Cloud Vision
Amazon Textract (Coming Soon)
Microsoft Azure Computer Vision (Coming Soon)

Roadmap

Add documentation and examples
Clean up interfaces and add unit tests
Launch
Improve OCR text performance
Add options to customize tagging styling
Add support for other browsers drivers as necessary
Add support for other OCR services as necessary

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/tarsier}
}

Project details

Release history Release notifications | RSS feed

0.8.2

Oct 1, 2024

0.8.1

Sep 26, 2024

0.8.0

Sep 26, 2024

0.7.2

Sep 13, 2024

0.7.1

Sep 1, 2024

0.7.0

Aug 23, 2024

0.6.39

Aug 23, 2024

0.6.38

Aug 15, 2024

0.6.37

Aug 14, 2024

0.6.35

Aug 13, 2024

0.6.33

Aug 11, 2024

0.6.32

Jul 26, 2024

0.6.3

Jun 27, 2024

0.6.2

Jun 14, 2024

0.6.1

Jun 13, 2024

0.6.0

Jun 13, 2024

0.5.941

May 14, 2024

0.5.95

Jun 12, 2024

0.5.94

Apr 11, 2024

0.5.93

Apr 11, 2024

0.5.92

Feb 20, 2024

This version

0.5.91

Feb 16, 2024

0.5.90

Feb 14, 2024

0.5.89

Feb 9, 2024

0.5.88

Feb 8, 2024

0.5.87

Feb 4, 2024

0.5.86

Feb 4, 2024

0.5.83

Feb 2, 2024

0.5.81

Feb 1, 2024

0.5.8

Feb 1, 2024

0.5.7

Jan 30, 2024

0.5.6

Dec 19, 2023

0.5.5

Dec 15, 2023

0.5.4

Dec 15, 2023

0.5.3

Dec 15, 2023

0.5.2

Dec 15, 2023

0.5.1

Dec 5, 2023

0.5.0

Dec 5, 2023

0.4.4

Nov 30, 2023

0.4.3

Nov 27, 2023

0.4.2

Nov 15, 2023

0.4.1

Nov 15, 2023

0.4.0

Nov 15, 2023

0.3.3

Nov 13, 2023

0.3.2

Nov 12, 2023

0.3.1

Nov 11, 2023

0.3.0

Nov 11, 2023

0.2.2

Nov 11, 2023

0.2.1

Nov 11, 2023

0.2.0

Nov 11, 2023

0.0.0

Apr 20, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier-0.5.91.tar.gz (14.6 kB view details)

Uploaded Feb 16, 2024 Source

Built Distribution

tarsier-0.5.91-py3-none-any.whl (15.7 kB view details)

Uploaded Feb 16, 2024 Python 3

File details

Details for the file tarsier-0.5.91.tar.gz.

File metadata

Download URL: tarsier-0.5.91.tar.gz
Upload date: Feb 16, 2024
Size: 14.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.2.0-1019-azure

File hashes

Hashes for tarsier-0.5.91.tar.gz
Algorithm	Hash digest
SHA256	`f65cf42f470cb0eb7ea87149aeb87602fc3f7ae0f146189ed1f7ed6ddfc8c2aa`
MD5	`c4300051d3e3d8bcd38673fc72e9b539`
BLAKE2b-256	`ee3b7bd14d2863d415ad40265e1761d7f656ef109abbde2dab681e56c155cd29`

See more details on using hashes here.

File details

Details for the file tarsier-0.5.91-py3-none-any.whl.

File metadata

Download URL: tarsier-0.5.91-py3-none-any.whl
Upload date: Feb 16, 2024
Size: 15.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.2.0-1019-azure

File hashes

Hashes for tarsier-0.5.91-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0ee8686bc607d2530a8ae1ce3d2e562387ecf81e72114886cc702935a81b7fb5`
MD5	`95befaf148557b15fc969793d32e7ae4`
BLAKE2b-256	`31504af542e57044a09f0d1f49780ff6b057c251eb2c25e874d6799c767207da`