Skip to main content

A Comprehensive Benchmark for Vision-Language-Action Models in Robotic Manipulation

Project description

🤖 VLA-Arena: A Comprehensive Benchmark for Vision-Language-Action Models

License Python Framework Tasks Docs

VLA-Arena is an open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models. VLA-Arena provides a full toolchain covering scenes modeling, demonstrations collection, models training and evaluation. It features 150+ tasks across 13 specialized suites, hierarchical difficulty levels (L0-L2), and comprehensive metrics for safety, generalization, and efficiency assessment.

VLA-Arena focuses on four key domains:

  • Safety: Operate reliably and safely in the physical world.
  • Distractors: Maintain stable performance when facing environmental unpredictability.
  • Extrapolation: Generalize learned knowledge to novel situations.
  • Long Horizon: Combine long sequences of actions to achieve a complex goal.

📰 News

2025.09.29: VLA-Arena is officially released!

🔥 Highlights

  • 🚀 End-to-End & Out-of-the-Box: We provide a complete and unified toolchain covering everything from scene modeling and behavior collection to model training and evaluation. Paired with comprehensive docs and tutorials, you can get started in minutes.
  • 🔌 Plug-and-Play Evaluation: Seamlessly integrate and benchmark your own VLA models. Our framework is designed with a unified API, making the evaluation of new architectures straightforward with minimal code changes.
  • 🛠️ Effortless Task Customization: Leverage the Constrained Behavior Definition Language (CBDDL) to rapidly define entirely new tasks and safety constraints. Its declarative nature allows you to achieve comprehensive scenario coverage with minimal effort.
  • 📊 Systematic Difficulty Scaling: Systematically assess model capabilities across three distinct difficulty levels (L0→L1→L2). Isolate specific skills and pinpoint failure points, from basic object manipulation to complex, long-horizon tasks.

If you find VLA-Arena useful, please cite it in your publications.

@misc{vla-arena2025,
  title={VLA-Arena},
  author={Jiahao Li, Borong Zhang, Jiachen Shen, Jiaming Ji, and Yaodong Yang},
  journal={GitHub repository},
  year={2025}
}

📚 Table of Contents

Quick Start

1. Installation

Install from PyPI (Recommended)

# 1. Install VLA-Arena
pip install vla-arena

# 2. Download task suites (required)
vla-arena-download-tasks install-all --repo vla-arena/tasks

📦 Important: To reduce PyPI package size, task suites and asset files must be downloaded separately after installation (~850 MB).

Install from Source

# Clone repository (includes all tasks and assets)
git clone https://github.com/PKU-Alignment/VLA-Arena.git
cd VLA-Arena

# Create environment
conda create -n vla-arena python=3.10
conda activate vla-arena

# Install requirements
pip install -r requirements.txt

# Install VLA-Arena
pip install -e .

Notes

  • The mujoco.dll file may be missing in the robosuite/utils directory, which can be obtained from mujoco/mujoco.dll;
  • When using on Windows platform, you need to modify the mujoco rendering method in robosuite\utils\binding_utils.py:
    if _SYSTEM == "Darwin":
      os.environ["MUJOCO_GL"] = "cgl"
    else:
      os.environ["MUJOCO_GL"] = "wgl"    # Change "egl" to "wgl"
    

2. Basic Evaluation

# Evaluate a trained model
python scripts/evaluate_policy.py \
    --task_suite safety_static_obstacles \
    --task_level 0 \
    --n-episode 10 \
    --policy openvla \
    --model_ckpt /path/to/checkpoint

3. Data Collection

# Collect demonstration data
python scripts/collect_demonstration.py --bddl-file tasks/your_task.bddl

For detailed instructions, see our Documentation section.

Task Suites Overview

VLA-Arena provides 11 specialized task suites with 150+ tasks total, organized into four domains:

🛡️ Safety (5 suites, 75 tasks)

Suite Description L0 L1 L2 Total
static_obstacles Static collision avoidance 5 5 5 15
cautious_grasp Safe grasping strategies 5 5 5 15
hazard_avoidance Hazard area avoidance 5 5 5 15
state_preservation Object state preservation 5 5 5 15
dynamic_obstacles Dynamic collision avoidance 5 5 5 15

🔄 Distractor (2 suites, 30 tasks)

Suite Description L0 L1 L2 Total
static_distractors Cluttered scene manipulation 5 5 5 15
dynamic_distractors Dynamic scene manipulation 5 5 5 15

🎯 Extrapolation (3 suites, 45 tasks)

Suite Description L0 L1 L2 Total
preposition_combinations Spatial relationship understanding 5 5 5 15
task_workflows Multi-step task planning 5 5 5 15
unseen_objects Unseen object recognition 5 5 5 15

📈 Long Horizon (1 suite, 20 tasks)

Suite Description L0 L1 L2 Total
long_horizon Long-horizon task planning 10 5 5 20

Difficulty Levels:

  • L0: Basic tasks with clear objectives
  • L1: Intermediate tasks with increased complexity
  • L2: Advanced tasks with challenging scenarios

🛡️ Safety Suites Visualization

Suite Name L0 L1 L2
Static Obstacles
Cautious Grasp
Hazard Avoidance
State Preservation
Dynamic Obstacles

🔄 Distractor Suites Visualization

Suite Name L0 L1 L2
Static Distractors
Dynamic Distractors

🎯 Extrapolation Suites Visualization

Suite Name L0 L1 L2
Preposition Combinations
Task Workflows
Unseen Objects

📈 Long Horizon Suite Visualization

Suite Name L0 L1 L2
Long Horizon

Installation

System Requirements

  • OS: Ubuntu 20.04+ or macOS 12+
  • Python: 3.9 or higher
  • CUDA: 11.8+ (for GPU acceleration)
  • RAM: 8GB minimum, 16GB recommended

Installation Steps

# Clone repository
git clone https://github.com/PKU-Alignment/VLA-Arena.git
cd VLA-Arena

# Create environment
conda create -n vla-arena python=3.10
conda activate vla-arena

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

Documentation

VLA-Arena provides comprehensive documentation for all aspects of the framework. Choose the guide that best fits your needs:

📖 Core Guides

🏗️ Scene Construction Guide | 中文版

Build custom task scenarios using CBDDL.

  • CBDDL file structure
  • Object and region definitions
  • State and goal specifications
  • Constraints, safety predicates and costs
  • Scene visualization

📊 Data Collection Guide | 中文版

Collect demonstrations in custom scenes.

  • Interactive simulation environment
  • Keyboard controls for robotic arm
  • Data format conversion
  • Dataset creation and optimization

🔧 Model Fine-tuning Guide | 中文版

Fine-tune VLA models using VLA-Arena generated datasets.

  • OpenVLA fine-tuning
  • Training scripts and configuration
  • Model evaluation

🎯 Model Evaluation Guide | 中文版

Evaluate VLA models and adding custom models to VLA-Arena.

  • Quick start evaluation
  • Supported models (OpenVLA)
  • Custom model integration
  • Configuration options

🔜 Quick Reference

Fine-tuning Scripts

Documentation Index

Leaderboard

OpenVLA-OFT Results (150,000 Training Steps and finetuned on VLA-Arena L0 datasets)

Overall Performance Summary

Model L0 Success L1 Success L2 Success Avg Success
OpenVLA-OFT 76.4% 36.3% 16.7% 36.5%

🛡️ Safety Performance

Task Suite L0 Success L1 Success L2 Success Avg Success
static_obstacles 100.0% 20.0% 20.0% 46.7%
cautious_grasp 60.0% 50.0% 0.0% 36.7%
hazard_avoidance 36.0% 0.0% 20.0% 18.7%
state_preservation 100.0% 76.0% 20.0% 65.3%
dynamic_obstacles 80.0% 56.0% 10.0% 48.7%

🛡️ Safety Cost Analysis

Task Suite L1 Total Cost L2 Total Cost Avg Total Cost
static_obstacles 45.40 49.00 47.20
cautious_grasp 6.34 2.12 4.23
hazard_avoidance 22.91 14.71 18.81
state_preservation 7.60 4.60 6.10
dynamic_obstacles 3.66 1.84 2.75

🔄 Distractor Performance

Task Suite L0 Success L1 Success L2 Success Avg Success
robustness_static_distractors 100.0% 0.0% 20.0% 40.0%
robustness_dynamic_distractors 100.0% 54.0% 40.0% 64.7%

🎯 Extrapolation Performance

Task Suite L0 Success L1 Success L2 Success Avg Success
preposition_combinations 62.0% 18.0% 0.0% 26.7%
task_workflows 74.0% 0.0% 0.0% 24.7%
unseen_objects 60.0% 40.0% 20.0% 40.0%

📈 Long Horizon Performance

Task Suite L0 Success L1 Success L2 Success Avg Success
long_horizon 80.0% 0.0% 0.0% 26.7%

License

This project is licensed under the Apache 2.0 license - see LICENSE for details.

Acknowledgments

  • RoboSuite, LIBERO, and VLABench teams for the framework
  • OpenVLA, UniVLA, Openpi, and lerobot teams for pioneering VLA research
  • All contributors and the robotics community

VLA-Arena: Advancing Vision-Language-Action Models Through Comprehensive Evaluation
Made with ❤️ by the VLA-Arena Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vla_arena-0.0.1.tar.gz (166.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vla_arena-0.0.1-py3-none-any.whl (225.4 kB view details)

Uploaded Python 3

File details

Details for the file vla_arena-0.0.1.tar.gz.

File metadata

  • Download URL: vla_arena-0.0.1.tar.gz
  • Upload date:
  • Size: 166.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for vla_arena-0.0.1.tar.gz
Algorithm Hash digest
SHA256 29cd82cbc04543668a15ac67b553b4830d791220731241c9d7a1a1b127d16317
MD5 fdc09371abc3beb8731f047db5c17b2a
BLAKE2b-256 f0e18ebab1e63adf649f2585631bd73e7c143fac38580c527ebfd9749f35c928

See more details on using hashes here.

File details

Details for the file vla_arena-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: vla_arena-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 225.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for vla_arena-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 390206b59895d97ac4825fdad6100a5804f7a52374978c7ee7810eaddc18abc0
MD5 c085b414ce8e828734452f1802cc68db
BLAKE2b-256 c43c965ac3c95e7025ee1db5428ed39d8cba3af39f70113beb8ee82389a1122e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page