Skip to main content

Platform for creating computer-use verifiable environments and training VLM agents to use them.

Project description

cua-bench

A framework for computer automation machine learning. Features a HTML-based desktop environment with a semantic design system that can visually emulate macos, win11, win10, ios, android, and more.

Installation

uv pip install -e .
playwright install chromium

Docker Setup (for batch processing)

Build the cua-bench Docker image:

docker build -t cua-bench:latest .

Quick Start

Create an environment

td create-task tasks/my_env

Run the environment:

td interact tasks/my_env

CLI Usage

Install an environment

td install tasks/click_env

List tasks

# List all environments
td tasks

# List tasks in specific environment
td tasks tasks/click_env

Interact with a task

Interact with a task in the browser. This is useful for debugging and testing.

td interact tasks/click_env --task-id 0 --solve --screenshot output.png

Run tasks with batch processing

Run a cluster of cua-bench tasks on GCP or locally. For multi-step trajectories, use td dump-solution. For single-step trajectories, use td dump-setup.

# Build Docker image first (required for local batch)
docker build -t cua-bench:latest .

# Local (Docker) - Run 4 tasks from click_env (setup + solve + evaluate)
td dump-solution tasks/click_env 4 --local

# Local (Docker) - Run 4 tasks from click_env (setup + evaluate)
td dump-setup tasks/click_env 4 --local --output-dir ./outputs

# GCP Batch - Run 16 tasks from click_env (setup + solve + evaluate)
td dump-solution tasks/click_env 16 --parallelism 8

# GCP Batch - Run 16 tasks from click_env (setup + evaluate)
td dump-setup tasks/click_env 16 --parallelism 8 --output-dir ./outputs

Process snapshots into a training dataset for UI grounding

Given a directory of snapshots, cua-bench offers a simple way to process them into a dataset for UI grounding using action augmentation.

# Process 5 snapshots using 'aguvis' action augmentation
td process ./outputs 5

# Process all snapshots and push to Hugging Face Hub
td process ./outputs --push-to-hub --repo-id username/repo

Programmatic Interface

import cua_bench as cb

# Create an environment
env = cb.make("tasks/click_env")

# Setup and get initial screenshot
screenshot, task_cfg = env.setup()  # optionally pass task_id

# Execute a step
screenshot = env.step('page.click("#submit")')

# Run the solution
screenshot = env.solve()

# Evaluate the result
rewards = env.evaluate()

# Clean up
env.close()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cua_bench-0.1.0.tar.gz (87.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cua_bench-0.1.0-py3-none-any.whl (2.3 kB view details)

Uploaded Python 3

File details

Details for the file cua_bench-0.1.0.tar.gz.

File metadata

  • Download URL: cua_bench-0.1.0.tar.gz
  • Upload date:
  • Size: 87.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.20

File hashes

Hashes for cua_bench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9b7b36511f4ea55c996a909821017e7b0fbfecf1276b16a1d69829ad47ae25b5
MD5 0c2e621527261d9406bd32046eb17c2b
BLAKE2b-256 a13360eafc9e492c3b47006d4e4e88be40249ff867b0442332235d9a1e041471

See more details on using hashes here.

File details

Details for the file cua_bench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cua_bench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.20

File hashes

Hashes for cua_bench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 549cb606a5ebe6796e278abacb038cd6ca44a0dbdfa94f95aca95dea7644e84c
MD5 41f982006bdf9504eeb97b66bc5ea691
BLAKE2b-256 3ec3589c90a3746e07ab221b73cfee313cfa120c036cc8b8e3c5c434bafd95a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page