actbench

A framework for evaluating web automation agents and LAM systems.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

              __  __                    __  
  ____ ______/ /_/ /_  ___  ____  _____/ /_ 
 / __ `/ ___/ __/ __ \/ _ \/ __ \/ ___/ __ \
/ /_/ / /__/ /_/ /_/ /  __/ / / / /__/ / / /
\__,_/\___/\__/_.___/\___/_/ /_/\___/_/ /_/

Overview

actbench is a extensible framework designed to evaluate the performance and capabilities of web automation agents and LAM systems.

Installing actbench CLI

actbench requires Python 3.12 or higher. We recommend using pipx for a clean, isolated installation:

pipx install actbench

Usage

1. Setting API Keys

Before running benchmarks, you need to set API keys for the agents you want to use.

actbench set-key --agent raccoonai

You can list the supported agents and check which API keys are stored:

actbench agents list

2. Listing Available Tasks

actbench provides a built-in dataset of web automation tasks, crafted by merging and refining tasks from the webarena and webvoyager datasets.
Duplicate tasks have been stripped out, and the queries have been refreshed to align with the most recent information.
If you want to explore how the tasks have been modified, you can trace their IDs back to the original datasets for a side-by-side comparison.

To see all the tasks currently available, just run this command:

actbench tasks list

3. Running Benchmarks

The run command is the heart of actbench. It allows you to execute tasks against specified agents.

Basic Usage

actbench run --agent raccoonai --task 256 --task 424

This command runs tasks with IDs 256 and 424 using the raccoonai agent.

Running All Tasks

actbench run --agent raccoonai --all-tasks

This runs all available tasks using the raccoonai agent.

Running Random Tasks

actbench run --agent raccoonai --random 5

This runs a random sample of 5 tasks using the raccoonai agent.

Running with All Agents

actbench run --all-agents --all-tasks

This runs all tasks with all configured agents (for which API keys are stored).

Controlling Parallelism

actbench run --agent raccoonai --all-tasks --parallel 4

This runs all tasks using the raccoonai agent, executing up to 4 tasks concurrently.

Setting Rate Limiting

actbench run --agent raccoonai --all-tasks --rate-limit 0.5

This adds a 0.5-second delay between task submissions.

Disabling Scoring

actbench run --agent raccoonai --all-tasks --no-scoring

This disables the LLM powered scoring, and gives all tasks a score of -1.

Combined Options

You can combine these options for more complex benchmark configurations:

actbench run --agent raccoonai --agent anotheragent --task 1 --task 2 --random 3 --parallel 2 --rate-limit 0.2

This command runs tasks 1 and 2, plus 3 random tasks, using both raccoonai and anotheragent (assuming API keys are set), with a parallelism of 2 and a rate limit of 0.2 seconds.

4. Viewing Results

The results command group allows you to manage and view benchmark results.

Listing Results

actbench results list

You can filter results by agent or run ID:

actbench results list --agent raccoonai
actbench results list --run-id <run_id>

Exporting Results

You can export results to JSON or CSV files:

actbench results export --format json --output results.json
actbench results export --format csv --output results.csv --agent raccoonai

Here's a complete table detailing the `actbench` CLI commands, their flags (options) and explanations:

Command	Flag(s) / Option(s)	Explanation
`actbench run`	`--task` / `-t`	Specifies one or more task IDs to run. Can be used multiple times. If omitted, other task selection flags (`--random`, `--all-tasks`) must be used.
	`--agent` / `-a`	Specifies one or more agents to use. Can be used multiple times. If omitted, `--all-agents` must be used.
	`--random` / `-r`	Runs a specified number of random tasks. Takes an integer argument (e.g., `--random 5`).
	`--all-tasks`	Runs all available tasks.
	`--all-agents`	Runs with all configured agents (for which API keys have been set).
	`--parallel` / `-p`	Sets the number of tasks to run concurrently. Takes an integer argument (e.g., `--parallel 4`). Defaults to 1 (no parallelism).
	`--rate-limit` / `-l`	Sets the delay (in seconds) between task submissions. Takes a float argument (e.g., `--rate-limit 0.5`). Defaults to 0.1.
	`--no-scoring` / `-ns`	Disables LLM-based scoring. Results will have a score of -1.
`actbench tasks list`	None	Lists all available tasks in the dataset, showing their ID, query, URL, complexity, and whether they require login.
`actbench set-key`	`--agent` / `-a`	Sets the API key for a specified agent. Prompts the user to enter the key securely. Example: `actbench set-key --agent raccoonai`
`actbench agents list`	None	Lists all supported agents, and shows which agents have API Keys stored.
`actbench results list`	`--agent` / `-a`	Filters the results to show only those for a specific agent.
	`--run-id` / `-r`	Filters the results to show only those for a specific run ID.
`actbench results export`	`--agent` / `-a`	Filters the results to be exported to a specific agent.
	`--run-id` / `-r`	Filters the results to be exported for a specific run ID.
	`--format` / `-f`	Specifies the export format. Must be one of `json` or `csv`. Defaults to `json`.
	`--output` / `-o`	Specifies the output file path. Required.
`actbench`	None	Prints the help message for the CLI.
`actbench --version`	None	Prints the actbench version number.

Extending actbench

Adding New Agents

Create a new client class: Create a new Python file in the actbench/clients/ directory (e.g., my_agent.py).
Implement the BaseClient interface: Your class should inherit from actbench.clients.BaseClient and implement the set_api_key() and run() methods.
Register your client: Add your client class to the _CLIENT_REGISTRY in actbench/clients/__init__.py.

Adding New Datasets

Create a new dataset class: Create a new Python file in the actbench/datasets/ directory (e.g., my_dataset.py).
Implement the BaseDataset interface: Your class should inherit from actbench.datasets.BaseDataset and implement the load_task_data(), get_all_task_ids(), and get_all_tasks() methods.
Provide your dataset file: Place your dataset file (e.g., my_dataset.jsonl) in the src/actbench/dataset/ directory.
Update _DATASET_INSTANCE: If you want to use this dataset by default, update the _DATASET_INSTANCE variable in src/actbench/datasets/__init__.py.

Adding New Evaluation Metrics

You can customize the evaluation process by modifying the Evaluator class in actbench/executor/evaluator.py or by creating a new evaluator and integrating it into the TaskExecutor.

Contributing

Contributions are welcome! Please follow these simple guidelines:

Fork the repository.
Create a new branch for your feature or bug fix.
Write clear and concise code with appropriate comments.
Submit a pull request.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

scorchy38

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.0.1a5 pre-release

Feb 27, 2025

0.0.1a4 pre-release

Feb 25, 2025

0.0.1a3 pre-release

Feb 25, 2025

0.0.1a2 pre-release

Feb 25, 2025

0.0.1a1 pre-release

Feb 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

actbench-0.0.1a5.tar.gz (130.1 kB view details)

Uploaded Feb 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

actbench-0.0.1a5-py3-none-any.whl (19.0 kB view details)

Uploaded Feb 27, 2025 Python 3

File details

Details for the file actbench-0.0.1a5.tar.gz.

File metadata

Download URL: actbench-0.0.1a5.tar.gz
Upload date: Feb 27, 2025
Size: 130.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.6.3

File hashes

Hashes for actbench-0.0.1a5.tar.gz
Algorithm	Hash digest
SHA256	`43aa62f898b422b5ba39d9bc8e9321b6477be36747b2995796dc5561c9c58200`
MD5	`e83b484e3ca4efb67fc1e02b0d2aae0f`
BLAKE2b-256	`2f8e4ae8aef6448c65a85d685087dc37e96751d35fd8674b5952001510d4c883`

See more details on using hashes here.

File details

Details for the file actbench-0.0.1a5-py3-none-any.whl.

File metadata

Download URL: actbench-0.0.1a5-py3-none-any.whl
Upload date: Feb 27, 2025
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.6.3

File hashes

Hashes for actbench-0.0.1a5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3989b7200dadab618129b5d49c31d5c365b5ebfb211ad983b984c6180ceb58c6`
MD5	`84ad0089839aaa0594fa7aeadb950aac`
BLAKE2b-256	`ff351404ea9cb34fd225a3ae3b8e73fbee9bc75d549a5c1aa401fb57c21f0fea`

See more details on using hashes here.

actbench 0.0.1a5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Overview

Installing actbench CLI

Usage

1. Setting API Keys

2. Listing Available Tasks

3. Running Benchmarks

Basic Usage

Running All Tasks

Running Random Tasks

Running with All Agents

Controlling Parallelism

Setting Rate Limiting

Disabling Scoring

Combined Options

4. Viewing Results

Listing Results

Exporting Results

Here's a complete table detailing the actbench CLI commands, their flags (options) and explanations:

Extending actbench

Adding New Agents

Adding New Datasets

Adding New Evaluation Metrics

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Here's a complete table detailing the `actbench` CLI commands, their flags (options) and explanations: