Vision utilities for web interaction agents

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Tarsier Monkey

🙈 Vision utilities for web interaction agents 🙈

🔗 Main site • 🐦 Twitter • 📢 Discord

Tarsier

If you've tried using GPT-4(V) to automate web interactions, you've probably run into questions like:

How do you map LLM responses back into web elements?
How can you mark up a page for an LLM better understand its action space?
How do you feed a "screenshot" to a text-only LLM?

At Reworkd, we found ourselves reusing the same utility libraries to solve these problems across multiple projects. Because of this we're now open-sourcing this simple utility library for multimodal web agents... Tarsier! The video below demonstrates Tarsier usage by feeding a page snapshot into a langchain agent and letting it take actions.

https://github.com/reworkd/tarsier/assets/50181239/af12beda-89b5-4add-b888-d780b353304b

How does it work?

Tarsier works by visually "tagging" interactable elements on a page via brackets + an id such as [1]. In doing this, we provide a mapping between elements and ids for GPT-4(V) to take actions upon. We define interactable elements as buttons, links, or input fields that are visible on the page.

Can provide a textual representation of the page. This means that Tarsier enables deeper interaction for even non multi-modal LLMs. This is important to note given performance issues with existing vision language models. Tarsier also provides OCR utils to convert a page screenshot into a whitespace-structured string that an LLM without vision can understand.

Installation

pip install tarsier

Usage

Visit our cookbook for agent examples using Tarsier:

An autonomous LangChain web agent 🦜⛓️
An autonomous LlamaIndex web agent 🦙

Otherwise, basic Tarsier usage might look like the following:

import asyncio

from playwright.async_api import async_playwright
from tarsier import Tarsier, GoogleVisionOCRService

async def main():
    google_cloud_credentials = {}

    ocr_service = GoogleVisionOCRService(google_cloud_credentials)
    tarsier = Tarsier(ocr_service)

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()
        await page.goto("https://news.ycombinator.com")

        page_text, tag_to_xpath = await tarsier.page_to_text(page)

        print(tag_to_xpath)  # Mapping of tags to x_paths
        print(page_text)  # My Text representation of the page


if __name__ == '__main__':
    asyncio.run(main())

Supported OCR Services

Google Cloud Vision
Amazon Textract (Coming Soon)
Microsoft Azure Computer Vision (Coming Soon)

Roadmap

Add documentation and examples
Clean up interfaces and add unit tests
Launch
Improve OCR text performance
Add options to customize tagging styling
Add support for other browsers drivers as necessary
Add support for other OCR services as necessary

Citations

bibtex
@misc{reworkd2023tarsier,
  title        = {Tarsier},
  author       = {Rohan Pandey and Adam Watkins and Asim Shrestha and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/tarsier}
}

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.5.941

May 14, 2024

0.5.94

Apr 11, 2024

0.5.93

Apr 11, 2024

0.5.92

Feb 20, 2024

0.5.91

Feb 16, 2024

0.5.90

Feb 14, 2024

0.5.89

Feb 9, 2024

0.5.88

Feb 8, 2024

0.5.87

Feb 4, 2024

0.5.86

Feb 4, 2024

0.5.83

Feb 2, 2024

0.5.81

Feb 1, 2024

0.5.8

Feb 1, 2024

0.5.7

Jan 30, 2024

0.5.6

Dec 19, 2023

0.5.5

Dec 15, 2023

0.5.4

Dec 15, 2023

0.5.3

Dec 15, 2023

0.5.2

Dec 15, 2023

0.5.1

Dec 5, 2023

0.5.0

Dec 5, 2023

0.4.4

Nov 30, 2023

0.4.3

Nov 27, 2023

0.4.2

Nov 15, 2023

0.4.1

Nov 15, 2023

0.4.0

Nov 15, 2023

This version

0.3.3

Nov 13, 2023

0.3.2

Nov 12, 2023

0.3.1

Nov 11, 2023

0.3.0

Nov 11, 2023

0.2.2

Nov 11, 2023

0.2.1

Nov 11, 2023

0.2.0

Nov 11, 2023

0.0.0

Apr 20, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tarsier-0.3.3.tar.gz (11.8 kB view hashes)

Uploaded Nov 13, 2023 Source

Built Distribution

tarsier-0.3.3-py3-none-any.whl (13.5 kB view hashes)

Uploaded Nov 13, 2023 Python 3

Hashes for tarsier-0.3.3.tar.gz

Hashes for tarsier-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`348d18e85a2f8425fedcf85212d1a2fab3be27a981cf5c75b88aba22f5062896`
MD5	`d61ecbc53fd21992b760d9ccc9ee7a7e`
BLAKE2b-256	`f1b1173fdb3ccecfe81eea0e78214c166398a697fa4ce7a4f0252f93f9141229`

Hashes for tarsier-0.3.3-py3-none-any.whl

Hashes for tarsier-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6498118806d98820a021efde9810dde0812213ab831d87500fad78e01a524564`
MD5	`8fbd6a4c36c9ea582e8da023823a5d72`
BLAKE2b-256	`d61c010457abcb763dbed5d9bae82d35dfbe3f9bd5bb12333935f80a646d9a71`