Platform for creating computer-use verifiable environments and training VLM agents to use them.
Project description
cua-bench
A set of tools for creating verifiable environments for computer automation tasks, evaluation, and training. Features support for both real Windows, Linux, macOS, and Android VM environments, as well as HTML-based webtop environments that can visually emulate macos, win11, win10, ios, android, and more.
Installation
uv tool install -e .
playwright install chromium
Docker Setup (for batch jobs and dataset processing)
Build the cua-bench Docker image:
docker build -t cua-bench:latest .
Quick Start
Create an environment
cb create-task tasks/my_env
Run the environment:
cb interact tasks/my_env
CLI Usage
Install an environment
cb install tasks/click_env
List tasks
# List all environments
cb tasks
# List tasks in specific environment
cb tasks tasks/click_env
Interact with a task
Interact with a task in the browser. This is useful for debugging and testing.
cb interact tasks/click_env --task-id 0 --solve --screenshot output.png
Evaluate agents on tasks
# Evaluate agent on tasks/click_env
cb eval tasks/click_env --model anthropic/claude-3-5-sonnet-20240620
Run tasks with batch processing
Run a cluster of cua-bench tasks on GCP or locally. For multi-step trajectories, use cb dump-solution. For single-step trajectories, use cb dump-setup.
# Build Docker image first (required for local batch)
docker build -t cua-bench:latest .
# Local (Docker) - Run 4 tasks from click_env (setup + solve + evaluate)
cb dump-solution tasks/click_env 4 --local
# Local (Docker) - Run 4 tasks from click_env (setup + evaluate)
cb dump-setup tasks/click_env 4 --local --output-dir ./outputs
# GCP Batch - Run 16 tasks from click_env (setup + solve + evaluate)
cb dump-solution tasks/click_env 16 --parallelism 8
# GCP Batch - Run 16 tasks from click_env (setup + evaluate)
cb dump-setup tasks/click_env 16 --parallelism 8 --output-dir ./outputs
Process snapshots into a training dataset for UI grounding
Given a directory of snapshots, cua-bench offers a simple way to process them into a dataset for UI grounding using action augmentation.
# Process 5 snapshots using 'aguvis' action augmentation
cb process ./outputs 5
# Process all snapshots and push to Hugging Face Hub
cb process ./outputs --push-to-hub --repo-id username/repo
Programmatic Interface
import cua_bench as cb
# Create an environment
env = cb.make("tasks/click_env")
# Setup and get initial screenshot
screenshot, task_cfg = env.reset() # optionally pass task_id
# Execute a step
screenshot = env.step('page.click("#submit")')
# Run the solution
screenshot = env.solve()
# Evaluate the result
rewards = env.evaluate()
# Clean up
env.close()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cua_bench-0.2.0.tar.gz.
File metadata
- Download URL: cua_bench-0.2.0.tar.gz
- Upload date:
- Size: 88.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fb9745b210bef148163c8cde5cbad01c41159015644a7fc0802ec47fc331be8
|
|
| MD5 |
cafb482cbbdb0e88ffa45380978dad6b
|
|
| BLAKE2b-256 |
14caf89fa41f4a251e51fcd0726cbf8f54d1e6aedef97e05fec460a26adc44db
|
File details
Details for the file cua_bench-0.2.0-py3-none-any.whl.
File metadata
- Download URL: cua_bench-0.2.0-py3-none-any.whl
- Upload date:
- Size: 87.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5033bc86054baae43822a787b42c8c9cf419daa1e4fa9c3fe6e19cf9eca2f533
|
|
| MD5 |
8d56ad54ed29fefbd799172a9ad2a8a5
|
|
| BLAKE2b-256 |
36500035b559c08e8bdc09f3f6a1f5a153638e694ede727af0a406f4eeedffe6
|