A framework for evaluating web automation agents and LAM systems.
Project description
__ __ __
____ ______/ /_/ /_ ___ ____ _____/ /_
/ __ `/ ___/ __/ __ \/ _ \/ __ \/ ___/ __ \
/ /_/ / /__/ /_/ /_/ / __/ / / / /__/ / / /
\__,_/\___/\__/_.___/\___/_/ /_/\___/_/ /_/
Overview
actbench is a extensible framework designed to evaluate the performance and capabilities of web automation agents and LAM systems.
Installing actbench CLI
actbench requires Python 3.12 or higher. We recommend using pipx for a clean, isolated installation:
pipx install actbench
Usage
1. Setting API Keys
Before running benchmarks, you need to set API keys for the agents you want to use.
actbench set-key --agent raccoonai
You can list the supported agents and check which API keys are stored:
actbench agents list
2. Listing Available Tasks
actbench provides a built-in dataset of web automation tasks, crafted by merging and refining tasks from the webarena and webvoyager datasets.
Duplicate tasks have been stripped out, and the queries have been refreshed to align with the most recent information.
If you want to explore how the tasks have been modified, you can trace their IDs back to the original datasets for a side-by-side comparison.
To see all the tasks currently available, just run this command:
actbench tasks list
3. Running Benchmarks
The run command is the heart of actbench. It allows you to execute tasks against specified agents.
Basic Usage
actbench run --agent raccoonai --task 256 --task 424
This command runs tasks with IDs 256 and 424 using the raccoonai agent.
Running All Tasks
actbench run --agent raccoonai --all-tasks
This runs all available tasks using the raccoonai agent.
Running Random Tasks
actbench run --agent raccoonai --random 5
This runs a random sample of 5 tasks using the raccoonai agent.
Running with All Agents
actbench run --all-agents --all-tasks
This runs all tasks with all configured agents (for which API keys are stored).
Controlling Parallelism
actbench run --agent raccoonai --all-tasks --parallel 4
This runs all tasks using the raccoonai agent, executing up to 4 tasks concurrently.
Setting Rate Limiting
actbench run --agent raccoonai --all-tasks --rate-limit 0.5
This adds a 0.5-second delay between task submissions.
Disabling Scoring
actbench run --agent raccoonai --all-tasks --no-scoring
This disables the LLM powered scoring, and gives all tasks a score of -1.
Combined Options
You can combine these options for more complex benchmark configurations:
actbench run --agent raccoonai --agent anotheragent --task 1 --task 2 --random 3 --parallel 2 --rate-limit 0.2
This command runs tasks 1 and 2, plus 3 random tasks, using both raccoonai and anotheragent (assuming API keys are set), with a parallelism of 2 and a rate limit of 0.2 seconds.
4. Viewing Results
The results command group allows you to manage and view benchmark results.
Listing Results
actbench results list
You can filter results by agent or run ID:
actbench results list --agent raccoonai
actbench results list --run-id <run_id>
Exporting Results
You can export results to JSON or CSV files:
actbench results export --format json --output results.json
actbench results export --format csv --output results.csv --agent raccoonai
Here's a complete table detailing the actbench CLI commands, their flags (options) and explanations:
| Command | Flag(s) / Option(s) | Explanation |
|---|---|---|
actbench run |
--task / -t |
Specifies one or more task IDs to run. Can be used multiple times. If omitted, other task selection flags (--random, --all-tasks) must be used. |
--agent / -a |
Specifies one or more agents to use. Can be used multiple times. If omitted, --all-agents must be used. |
|
--random / -r |
Runs a specified number of random tasks. Takes an integer argument (e.g., --random 5). |
|
--all-tasks |
Runs all available tasks. | |
--all-agents |
Runs with all configured agents (for which API keys have been set). | |
--parallel / -p |
Sets the number of tasks to run concurrently. Takes an integer argument (e.g., --parallel 4). Defaults to 1 (no parallelism). |
|
--rate-limit / -l |
Sets the delay (in seconds) between task submissions. Takes a float argument (e.g., --rate-limit 0.5). Defaults to 0.1. |
|
--no-scoring / -ns |
Disables LLM-based scoring. Results will have a score of -1. | |
actbench tasks list |
None | Lists all available tasks in the dataset, showing their ID, query, URL, complexity, and whether they require login. |
actbench set-key |
--agent / -a |
Sets the API key for a specified agent. Prompts the user to enter the key securely. Example: actbench set-key --agent raccoonai |
actbench agents list |
None | Lists all supported agents, and shows which agents have API Keys stored. |
actbench results list |
--agent / -a |
Filters the results to show only those for a specific agent. |
--run-id / -r |
Filters the results to show only those for a specific run ID. | |
actbench results export |
--agent / -a |
Filters the results to be exported to a specific agent. |
--run-id / -r |
Filters the results to be exported for a specific run ID. | |
--format / -f |
Specifies the export format. Must be one of json or csv. Defaults to json. |
|
--output / -o |
Specifies the output file path. Required. | |
actbench |
None | Prints the help message for the CLI. |
actbench --version |
None | Prints the actbench version number. |
Extending actbench
Adding New Agents
- Create a new client class: Create a new Python file in the
actbench/clients/directory (e.g.,my_agent.py). - Implement the
BaseClientinterface: Your class should inherit fromactbench.clients.BaseClientand implement theset_api_key()andrun()methods. - Register your client: Add your client class to the
_CLIENT_REGISTRYinactbench/clients/__init__.py.
Adding New Datasets
- Create a new dataset class: Create a new Python file in the
actbench/datasets/directory (e.g.,my_dataset.py). - Implement the
BaseDatasetinterface: Your class should inherit fromactbench.datasets.BaseDatasetand implement theload_task_data(),get_all_task_ids(), andget_all_tasks()methods. - Provide your dataset file: Place your dataset file (e.g.,
my_dataset.jsonl) in thesrc/actbench/dataset/directory. - Update
_DATASET_INSTANCE: If you want to use this dataset by default, update the_DATASET_INSTANCEvariable insrc/actbench/datasets/__init__.py.
Adding New Evaluation Metrics
You can customize the evaluation process by modifying the Evaluator class in actbench/executor/evaluator.py or by creating a new evaluator and integrating it into the TaskExecutor.
Contributing
Contributions are welcome! Please follow these simple guidelines:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Write clear and concise code with appropriate comments.
- Submit a pull request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file actbench-0.0.1a5.tar.gz.
File metadata
- Download URL: actbench-0.0.1a5.tar.gz
- Upload date:
- Size: 130.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
43aa62f898b422b5ba39d9bc8e9321b6477be36747b2995796dc5561c9c58200
|
|
| MD5 |
e83b484e3ca4efb67fc1e02b0d2aae0f
|
|
| BLAKE2b-256 |
2f8e4ae8aef6448c65a85d685087dc37e96751d35fd8674b5952001510d4c883
|
File details
Details for the file actbench-0.0.1a5-py3-none-any.whl.
File metadata
- Download URL: actbench-0.0.1a5-py3-none-any.whl
- Upload date:
- Size: 19.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.6.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3989b7200dadab618129b5d49c31d5c365b5ebfb211ad983b984c6180ceb58c6
|
|
| MD5 |
84ad0089839aaa0594fa7aeadb950aac
|
|
| BLAKE2b-256 |
ff351404ea9cb34fd225a3ae3b8e73fbee9bc75d549a5c1aa401fb57c21f0fea
|