Open source AI Agent evaluation framework for web tasks 🐒🍌

These details have not been verified by PyPI

Project description

Monkey Looking at banana

🍌 Open source AI Agent evaluations for web tasks 🍌

Python

Banana-lyzer

Introduction

Banana-lyzer is an open source AI Agent evaluation framework and dataset for web tasks with Playwright (And has a banana theme because why not). We've created our own evals repo because:

Websites change overtime, are affected by latency, and may have anti bot protections.
We need a system that can reliably save and deploy historic/static snapshots of websites.
Standard web practices are loose and there is an abundance of different underlying ways to represent a single individual website. For an agent to best generalize, we require building a diverse dataset of websites across industries and use-cases.
We have specific evaluation criteria and agent use cases focusing on structured and direct information retrieval across websites.
There exists valuable web task datasets and evaluations that we'd like to unify in a single repo (Mind2Web, WebArena, etc).

https://github.com/reworkd/bananalyzer/assets/50181239/4587615c-a5b4-472d-bca9-334594130af1

How does it work?

⚠️ Note that this repo is a work in progress. ⚠️

Banana-lyzer is a CLI tool that runs a set of evaluations against a set of example websites. The examples are defined in examples.json using a schema similar to Mind2Web and WebArena. The examples store metadata like the agent goal and the expected agent output in addition to snapshots of urls via mhtml to ensure the page is not changed over time. Note all examples today expect structured JSON output using data directly extracted from the page.

The CLI tool will sequentially run examples against a user defined agent by dynamically constructing a pytest test suite and executing it. As a user, you simply create a file that implements the AgentRunner interface and defines an instance of your AgentRunner in a variable called "agent". AgentRunner exposes the example, and a playwright browser context to use.

In the future we will support more complex evaluation methods and examples that require multiple steps to complete. The plan is to translate existing datasets like Mind2Web and WebArena into this format.

Test intents

We have defined a set of test intents that an agent can be evaluated on. These intents are defined in the GoalType enum in examples.json.

fetch: The agent must retrieve specific JSON information from the page. This is the most common test type.
links: The agent must scrape all detail page links from a page
links_fetch: The agent must scrape all detail page links from a page and additionally extract JSON information for each link

Getting Started

Local testing installation

pip install --dev bananalyzer
Implement the agent_runner.py interface and make a banalyzer.py test file (The name doesn't matter). Below is an example file

import asyncio
from playwright.async_api import BrowserContext
from bananalyzer.data.schemas import Example
from bananalyzer.runner.agent_runner import AgentResult, AgentRunner


class NullAgentRunner(AgentRunner):
    """
    A test agent class that just returns an empty string
    """

    async def run(
        self,
        context: BrowserContext,
        example: Example,
    ) -> AgentResult:
        page = await context.new_page()
        await page.goto(
            example.get_static_url())  # example.url has the real url, example.get_static_url() returns the local mhtml file url
        await asyncio.sleep(0.5)
        return example.evals[0].expected  # Just return expected output directly so that tests pass

Run bananalyze ./tests/banalyzer.py to run the test suite
You can also run bananalyze . to run all tests in the current directory

Arguments

-h or --help: Show help
--headless: Run with Playwright headless mode
-id or --id: Run a specific test by id
-i or --intent: Only run tests of a particular intent (fetch, links, etc)
-c or --category: Only run tests of a particular category (healthcare, manufacturing, software, etc)
-n or --n: Number of test workers to use. The default is 1
-skip or --skip: A list of ids to skip tests on, separated by commas
-t or --type: Only run tests of a particular type (links, fetch, etc)

Contributing

Running the server

The project has a basic FastAPI server to expose example data. You can run it with the following command:

cd server
poetry run uvicorn server:app --reload

Then travel to http://127.0.0.1:8000/api/docs in your browser to see the API docs.

Adding examples

All current examples have been manually added through running the fetch.ipynb notebook at the root of this project. This notebook will load a site with Playwright and use the chrome developer API to save the page as an MHTML file.

Roadmap

Launch

Functions to serve local MHTML sites
Agent interface required for running the tool
Pytest wrapper to enable CLI testing with additional arguments
Document a majority of the repo

Features

CLI param to filter tests by intent
Additional CLI params to select for specific tests or test categories
Ability to add multiple site pages to examples
Ability to add in-page actions to examples
Translate WebArena evals
Translate Mind2Web evals
Lag and bot detection emulation
Updated test visualization with separation of categories and outputs

Dataset updates

15 additional data retrieval examples
15 additional link examples
15 click examples
15 navigation examples
Tests requiring multi-step navigation
Tests requiring both navigation and data retrieval
Tests requiring pop-up closing
Tests requiring sign-in
Tests requiring captcha solving

Citations

bibtex
@misc{reworkd2023bananalyzer,
  title        = {Bananalyzer},
  author       = {Asim Shrestha and Adam Watkins and Rohan Pandey and Srijan Subedi},
  year         = {2023},
  howpublished = {GitHub},
  url          = {https://github.com/reworkd/bananalyzer}
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.12.0

Oct 20, 2024

0.11.0

Oct 20, 2024

0.10.10

Oct 17, 2024

0.10.8

Sep 23, 2024

0.10.7

Sep 19, 2024

0.10.6

Sep 13, 2024

0.10.5

Sep 13, 2024

0.10.4

Sep 12, 2024

0.10.3

Sep 12, 2024

0.10.2

Sep 5, 2024

0.10.1

Aug 29, 2024

0.9.17

Aug 26, 2024

0.9.16

Aug 23, 2024

0.9.15

Aug 22, 2024

0.9.14

Aug 22, 2024

0.9.13

Aug 22, 2024

0.9.12

Aug 21, 2024

0.9.11

Aug 20, 2024

0.9.10

Aug 20, 2024

0.9.0

Aug 20, 2024

0.8.80

Aug 15, 2024

0.8.79

Aug 2, 2024

0.8.78

Jul 18, 2024

0.8.77

Jul 18, 2024

0.8.76

Jul 18, 2024

0.8.75

Jul 17, 2024

0.8.74

Apr 5, 2024

0.8.73

Apr 3, 2024

0.8.72

Mar 19, 2024

0.8.70

Feb 20, 2024

0.8.69

Feb 20, 2024

0.8.68

Feb 19, 2024

0.8.67

Feb 19, 2024

0.8.66

Feb 19, 2024

0.8.65

Feb 16, 2024

0.8.64

Feb 16, 2024

0.8.63

Feb 14, 2024

0.8.62

Feb 11, 2024

0.8.61

Feb 6, 2024

0.8.6

Jan 30, 2024

0.8.5

Jan 19, 2024

0.8.3

Jan 17, 2024

0.8.2

Jan 17, 2024

0.8.1

Jan 17, 2024

0.8.0

Jan 17, 2024

0.7.5

Jan 15, 2024

0.7.4

Jan 15, 2024

0.7.3

Jan 15, 2024

0.7.2

Jan 14, 2024

0.7.1

Jan 13, 2024

0.7.0

Dec 14, 2023

0.6.20

Dec 13, 2023

0.6.19

Dec 11, 2023

0.6.18

Dec 5, 2023

This version

0.6.17

Dec 5, 2023

0.6.16

Dec 5, 2023

0.6.15

Dec 5, 2023

0.6.13

Dec 4, 2023

0.6.12

Dec 4, 2023

0.6.11

Dec 2, 2023

0.6.10

Dec 1, 2023

0.6.9

Nov 30, 2023

0.6.8

Nov 29, 2023

0.6.7

Nov 29, 2023

0.6.6

Nov 29, 2023

0.6.5

Nov 29, 2023

0.6.4

Nov 28, 2023

0.6.3

Nov 28, 2023

0.6.2

Nov 28, 2023

0.6.1

Nov 28, 2023

0.5.7

Nov 28, 2023

0.5.6

Nov 27, 2023

0.5.5

Nov 27, 2023

0.5.4

Nov 24, 2023

0.5.3

Nov 23, 2023

0.5.2

Nov 23, 2023

0.5.1

Nov 22, 2023

0.5.0

Nov 22, 2023

0.3.5

Nov 16, 2023

0.3.4

Nov 16, 2023

0.3.3

Nov 16, 2023

0.3.2

Nov 16, 2023

0.3.1

Nov 15, 2023

0.3.0

Nov 15, 2023

0.2.9

Nov 15, 2023

0.2.8

Nov 15, 2023

0.2.7

Nov 15, 2023

0.2.6

Nov 15, 2023

0.2.5

Nov 15, 2023

0.2.4

Nov 14, 2023

0.2.3

Nov 14, 2023

0.2.2

Nov 13, 2023

0.2.1

Nov 14, 2023

0.2.0

Nov 13, 2023

0.1.9

Nov 13, 2023

0.1.8

Nov 10, 2023

0.1.7

Nov 10, 2023

0.1.6

Nov 10, 2023

0.1.5

Nov 9, 2023

0.1.4

Nov 8, 2023

0.1.3

Nov 8, 2023

0.1.2

Nov 8, 2023

0.1.1

Nov 7, 2023

0.1.0

Nov 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bananalyzer-0.6.17.tar.gz (19.0 kB view details)

Uploaded Dec 5, 2023 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bananalyzer-0.6.17-py3-none-any.whl (20.9 kB view details)

Uploaded Dec 5, 2023 Python 3

File details

Details for the file bananalyzer-0.6.17.tar.gz.

File metadata

Download URL: bananalyzer-0.6.17.tar.gz
Upload date: Dec 5, 2023
Size: 19.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.2.0-1016-azure

File hashes

Hashes for bananalyzer-0.6.17.tar.gz
Algorithm	Hash digest
SHA256	`c98faeb49985846a7bb3a7bf33529677337a1d3ea8a08de2a8d6abaceda821df`
MD5	`256503feb9224ce022d2275a953ffe27`
BLAKE2b-256	`c18f3577a6d2e4df8fdfacaf1adefec1c58c9c24411b6ca7516898e35323289d`

See more details on using hashes here.

File details

Details for the file bananalyzer-0.6.17-py3-none-any.whl.

File metadata

Download URL: bananalyzer-0.6.17-py3-none-any.whl
Upload date: Dec 5, 2023
Size: 20.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/6.2.0-1016-azure

File hashes

Hashes for bananalyzer-0.6.17-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f76b738f85e31d5bfb87de1f374cdc914fb1fc2a0219d829f50a5c5314a52b52`
MD5	`56ca729f5a1db5d74d76085236e5e289`
BLAKE2b-256	`5e0ba0ca713bd597a1daad4858d791d7e7aae7f6d07bb3f9e6645633ddc6d1db`

See more details on using hashes here.

bananalyzer 0.6.17

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Banana-lyzer

Introduction

How does it work?

Test intents

Getting Started

Local testing installation

Arguments

Contributing

Running the server

Adding examples

Roadmap

Launch

Features

Dataset updates

Citations

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes