Skip to main content

Platform for creating computer-use verifiable environments and training VLM agents to use them.

Project description

cua-bench

A set of tools for creating verifiable environments for computer automation tasks, evaluation, and training. Features support for both real Windows, Linux, macOS, and Android VM environments, as well as HTML-based webtop environments that can visually emulate macos, win11, win10, ios, android, and more.

Installation

uv tool install -e .
playwright install chromium

Docker Setup (for batch jobs and dataset processing)

Build the cua-bench Docker image:

docker build -t cua-bench:latest .

Quick Start

Create an environment

cb create-task tasks/my_env

Run the environment:

cb interact tasks/my_env

CLI Usage

Install an environment

cb install tasks/click_env

List tasks

# List all environments
cb tasks

# List tasks in specific environment
cb tasks tasks/click_env

Interact with a task

Interact with a task in the browser. This is useful for debugging and testing.

cb interact tasks/click_env --task-id 0 --solve --screenshot output.png

Evaluate agents on tasks

# Evaluate agent on tasks/click_env
cb eval tasks/click_env --model anthropic/claude-3-5-sonnet-20240620

Run tasks with batch processing

Run a cluster of cua-bench tasks on GCP or locally. For multi-step trajectories, use cb dump-solution. For single-step trajectories, use cb dump-setup.

# Build Docker image first (required for local batch)
docker build -t cua-bench:latest .

# Local (Docker) - Run 4 tasks from click_env (setup + solve + evaluate)
cb dump-solution tasks/click_env 4 --local

# Local (Docker) - Run 4 tasks from click_env (setup + evaluate)
cb dump-setup tasks/click_env 4 --local --output-dir ./outputs

# GCP Batch - Run 16 tasks from click_env (setup + solve + evaluate)
cb dump-solution tasks/click_env 16 --parallelism 8

# GCP Batch - Run 16 tasks from click_env (setup + evaluate)
cb dump-setup tasks/click_env 16 --parallelism 8 --output-dir ./outputs

Process snapshots into a training dataset for UI grounding

Given a directory of snapshots, cua-bench offers a simple way to process them into a dataset for UI grounding using action augmentation.

# Process 5 snapshots using 'aguvis' action augmentation
cb process ./outputs 5

# Process all snapshots and push to Hugging Face Hub
cb process ./outputs --push-to-hub --repo-id username/repo

Programmatic Interface

import cua_bench as cb

# Create an environment
env = cb.make("tasks/click_env")

# Setup and get initial screenshot
screenshot, task_cfg = env.reset()  # optionally pass task_id

# Execute a step
screenshot = env.step('page.click("#submit")')

# Run the solution
screenshot = env.solve()

# Evaluate the result
rewards = env.evaluate()

# Clean up
env.close()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cua_bench-0.2.0.tar.gz (88.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cua_bench-0.2.0-py3-none-any.whl (87.7 MB view details)

Uploaded Python 3

File details

Details for the file cua_bench-0.2.0.tar.gz.

File metadata

  • Download URL: cua_bench-0.2.0.tar.gz
  • Upload date:
  • Size: 88.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.20

File hashes

Hashes for cua_bench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4fb9745b210bef148163c8cde5cbad01c41159015644a7fc0802ec47fc331be8
MD5 cafb482cbbdb0e88ffa45380978dad6b
BLAKE2b-256 14caf89fa41f4a251e51fcd0726cbf8f54d1e6aedef97e05fec460a26adc44db

See more details on using hashes here.

File details

Details for the file cua_bench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cua_bench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 87.7 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.20

File hashes

Hashes for cua_bench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5033bc86054baae43822a787b42c8c9cf419daa1e4fa9c3fe6e19cf9eca2f533
MD5 8d56ad54ed29fefbd799172a9ad2a8a5
BLAKE2b-256 36500035b559c08e8bdc09f3f6a1f5a153638e694ede727af0a406f4eeedffe6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page