Skip to main content

Open source AI Agent evaluation framework for web tasks 🐒🍌

Project description

Monkey Looking at banana

🍌 Open source AI Agent evaluations for web tasks 🍌

Python

🔗 Main site   •   🐦 Twitter   •   📢 Discord


Banana-lyzer

Introduction

Banana-lyzer is an open source AI Agent evaluation framework and dataset for web tasks with Playwright (And has a banana theme because why not). We've created our own evals repo because:

  • Websites change overtime, are affected by latency, and may have anti bot protections.
  • We need a system that can reliably save and deploy historic/static snapshots of websites.
  • Standard web practices are loose and there is an abundance of different underlying ways to represent a single individual website. For an agent to best generalize, we require building a diverse dataset of websites across industries and use-cases.
  • We have specific evaluation criteria and agent use cases focusing on structured and direct information retrieval across websites.
  • There exists valuable web task datasets and evaluations that we'd like to unify in a single repo (Mind2Web, WebArena, etc).

https://github.com/reworkd/bananalyzer/assets/50181239/4587615c-a5b4-472d-bca9-334594130af1

How does it work?

⚠️ Note that this repo is a work in progress. ⚠️

Banana-lyzer is a CLI tool that runs a set of evaluations against a set of example websites. The examples are defined in examples.json using a schema similar to Mind2Web and WebArena. The examples store metadata like the agent goal and the expected agent output in addition to snapshots of urls via mhtml to ensure the page is not changed over time. Note all examples today expect structured JSON output using data directly extracted from the page.

The CLI tool will sequentially run examples against a user defined agent by dynamically constructing a pytest test suite and executing it. As a user, you simply create a file that implements the AgentRunner interface and defines an instance of your AgentRunner in a variable called "agent". AgentRunner exposes the example, and a playwright browser context to use.

In the future we will support more complex evaluation methods and examples that require multiple steps to complete. The plan is to translate existing datasets like Mind2Web and WebArena into this format.

Test intents

We have defined a set of page types and test intents an agent can be evaluated on. These types are defined in the ExampleType enum in schemas.py.

  • listing: The example starts on a listing page but must scrape all detail page links and the information from those detail pages. Note that currently, we only test that all of the detail page URLs were captured.
  • detail: The example starts on a detail page and the agent must retrieve specific JSON information from the page. This is the most common test type.
  • listing_detail: The agent is on a listing page and must scrape all information from the current page. All of the required information is available on the current page. The agent need not visit the detail page.

Separately, there are specific tags that can be used further filter test intents

  • pagination: Must fetch data across pages. Either links or fetch for now.

Getting Started

Local testing installation

  • pip install --dev bananalyzer
  • Implement the agent_runner.py interface and make a banalyzer.py test file (The name doesn't matter). Below is an example file
import asyncio
from playwright.async_api import BrowserContext
from bananalyzer.data.schemas import Example
from bananalyzer.runner.agent_runner import AgentResult, AgentRunner


class NullAgentRunner(AgentRunner):
    """
    A test agent class that just returns an empty string
    """

    async def run(
        self,
        context: BrowserContext,
        example: Example,
    ) -> AgentResult:
        page = await context.new_page()
        await page.goto(
            example.get_static_url())  # example.url has the real url, example.get_static_url() returns the local mhtml file url
        await asyncio.sleep(0.5)
        return example.evals[0].expected  # Just return expected output directly so that tests pass
  • Run bananalyze ./tests/banalyzer.py to run the test suite
  • You can also run bananalyze . to run all tests in the current directory
  • To run local examples (from the repo's static folder) on MacOS, please run unix2dos static/*/*.mhtml to convert CRLF formatting in MHTML files

Arguments

  • -h or --help: Show help
  • --headless: Run with Playwright headless mode
  • -id or --id: Run a specific test by id
  • -i or --intent: Only run tests of a particular intent (fetch, links, etc)
  • -c or --category: Only run tests of a particular category (healthcare, manufacturing, software, etc)
  • -n or --n: Number of test workers to use. The default is 1
  • -skip or --skip: A list of ids to skip tests on, separated by commas
  • -t or --type: Only run tests of a particular type (links, fetch, etc)

Contributing

Running the server

The project has a basic FastAPI server to expose example data. You can run it with the following command:

cd server
poetry run uvicorn server:app --reload

Then travel to http://127.0.0.1:8000/api/docs in your browser to see the API docs.

Adding examples

All current examples have been manually added through running the fetch.ipynb notebook at the root of this project. This notebook will load a site with Playwright and use the chrome developer API to save the page as an MHTML file.

Roadmap

Launch
  • Functions to serve local MHTML sites
  • Agent interface required for running the tool
  • Pytest wrapper to enable CLI testing with additional arguments
  • Document a majority of the repo
  • Functions to serve complicated pages via HAR
Features
  • CLI param to filter tests by intent
  • Additional CLI params to select for specific tests or test categories
  • Ability to add multiple site pages to examples
  • Ability to add in-page actions to examples
  • Translate WebArena evals
  • Translate Mind2Web evals
  • Lag and bot detection emulation
  • Updated test visualization with separation of categories and outputs
Dataset updates
  • 15 additional data retrieval examples
  • 15 additional link examples
  • 15 click examples
  • 15 navigation examples
  • Tests requiring multi-step navigation
  • Tests requiring both navigation and data retrieval
  • Tests requiring pop-up closing
  • Tests requiring sign-in
  • Tests requiring captcha solving

Citations

bibtex
@misc{reworkd2023bananalyzer,
  title        = {Bananalyzer},
  author       = {Asim Shrestha and Adam Watkins and Rohan Pandey and Srijan Subedi and Sunshine},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/bananalyzer}
}

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bananalyzer-0.10.4.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

bananalyzer-0.10.4-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file bananalyzer-0.10.4.tar.gz.

File metadata

  • Download URL: bananalyzer-0.10.4.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for bananalyzer-0.10.4.tar.gz
Algorithm Hash digest
SHA256 a84beaafc0981f7a129d17b86521333b7d8acf050a63b41ede5f082d062bae82
MD5 20d10fa037da3f5e4f1bfe92401ce970
BLAKE2b-256 f764be61193df6fbb4e168bbded6d9e8e41186480b99dd9b10d05ce2bca9b576

See more details on using hashes here.

File details

Details for the file bananalyzer-0.10.4-py3-none-any.whl.

File metadata

  • Download URL: bananalyzer-0.10.4-py3-none-any.whl
  • Upload date:
  • Size: 28.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for bananalyzer-0.10.4-py3-none-any.whl
Algorithm Hash digest
SHA256 51f42d27770ab7b72ca43d767ab60a898c5632416cd85f7bf943d0ff7660e3a4
MD5 16222d10f718bf3b6ea748c01952a755
BLAKE2b-256 b33a19a264e1266a78b8f751e704037e74ff0f503d49ff20595a57cd80a34b71

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page