A Comprehensive Benchmark for Vision-Language-Action Models in Robotic Manipulation

These details have not been verified by PyPI

Project links

Project description

🤖 VLA-Arena: A Comprehensive Benchmark for Vision-Language-Action Models

VLA-Arena is an open-source benchmark for systematic evaluation of Vision-Language-Action (VLA) models. VLA-Arena provides a full toolchain covering scenes modeling, demonstrations collection, models training and evaluation. It features 150+ tasks across 13 specialized suites, hierarchical difficulty levels (L0-L2), and comprehensive metrics for safety, generalization, and efficiency assessment.

VLA-Arena focuses on four key domains:

Safety: Operate reliably and safely in the physical world.
Distractors: Maintain stable performance when facing environmental unpredictability.
Extrapolation: Generalize learned knowledge to novel situations.
Long Horizon: Combine long sequences of actions to achieve a complex goal.

📰 News

2025.09.29: VLA-Arena is officially released!

🔥 Highlights

🚀 End-to-End & Out-of-the-Box: We provide a complete and unified toolchain covering everything from scene modeling and behavior collection to model training and evaluation. Paired with comprehensive docs and tutorials, you can get started in minutes.
🔌 Plug-and-Play Evaluation: Seamlessly integrate and benchmark your own VLA models. Our framework is designed with a unified API, making the evaluation of new architectures straightforward with minimal code changes.
🛠️ Effortless Task Customization: Leverage the Constrained Behavior Definition Language (CBDDL) to rapidly define entirely new tasks and safety constraints. Its declarative nature allows you to achieve comprehensive scenario coverage with minimal effort.
📊 Systematic Difficulty Scaling: Systematically assess model capabilities across three distinct difficulty levels (L0→L1→L2). Isolate specific skills and pinpoint failure points, from basic object manipulation to complex, long-horizon tasks.

If you find VLA-Arena useful, please cite it in your publications.

@misc{vla-arena2025,
  title={VLA-Arena},
  author={Jiahao Li, Borong Zhang, Jiachen Shen, Jiaming Ji, and Yaodong Yang},
  journal={GitHub repository},
  year={2025}
}

Quick Start

1. Installation

Install from PyPI (Recommended)

# 1. Install VLA-Arena
pip install vla-arena

# 2. Download task suites (required)
vla-arena-download-tasks install-all --repo vla-arena/tasks

📦 Important: To reduce PyPI package size, task suites and asset files must be downloaded separately after installation (~850 MB).

Install from Source

# Clone repository (includes all tasks and assets)
git clone https://github.com/PKU-Alignment/VLA-Arena.git
cd VLA-Arena

# Create environment
conda create -n vla-arena python=3.10
conda activate vla-arena

# Install requirements
pip install -r requirements.txt

# Install VLA-Arena
pip install -e .

Notes

The mujoco.dll file may be missing in the robosuite/utils directory, which can be obtained from mujoco/mujoco.dll;

When using on Windows platform, you need to modify the mujoco rendering method in robosuite\utils\binding_utils.py:

if _SYSTEM == "Darwin":
  os.environ["MUJOCO_GL"] = "cgl"
else:
  os.environ["MUJOCO_GL"] = "wgl"    # Change "egl" to "wgl"

2. Basic Evaluation

# Evaluate a trained model
python scripts/evaluate_policy.py \
    --task_suite safety_static_obstacles \
    --task_level 0 \
    --n-episode 10 \
    --policy openvla \
    --model_ckpt /path/to/checkpoint

3. Data Collection

# Collect demonstration data
python scripts/collect_demonstration.py --bddl-file tasks/your_task.bddl

For detailed instructions, see our Documentation section.

Task Suites Overview

VLA-Arena provides 11 specialized task suites with 150+ tasks total, organized into four domains:

🛡️ Safety (5 suites, 75 tasks)

Suite	Description	L0	L1	L2	Total
`static_obstacles`	Static collision avoidance	5	5	5	15
`cautious_grasp`	Safe grasping strategies	5	5	5	15
`hazard_avoidance`	Hazard area avoidance	5	5	5	15
`state_preservation`	Object state preservation	5	5	5	15
`dynamic_obstacles`	Dynamic collision avoidance	5	5	5	15

🔄 Distractor (2 suites, 30 tasks)

Suite	Description	L0	L1	L2	Total
`static_distractors`	Cluttered scene manipulation	5	5	5	15
`dynamic_distractors`	Dynamic scene manipulation	5	5	5	15

🎯 Extrapolation (3 suites, 45 tasks)

Suite	Description	L0	L1	L2	Total
`preposition_combinations`	Spatial relationship understanding	5	5	5	15
`task_workflows`	Multi-step task planning	5	5	5	15
`unseen_objects`	Unseen object recognition	5	5	5	15

📈 Long Horizon (1 suite, 20 tasks)

Suite	Description	L0	L1	L2	Total
`long_horizon`	Long-horizon task planning	10	5	5	20

Difficulty Levels:

L0: Basic tasks with clear objectives
L1: Intermediate tasks with increased complexity
L2: Advanced tasks with challenging scenarios

🛡️ Safety Suites Visualization

Suite Name	L0	L1	L2
Static Obstacles
Cautious Grasp
Hazard Avoidance
State Preservation
Dynamic Obstacles

🔄 Distractor Suites Visualization

Suite Name	L0	L1	L2
Static Distractors
Dynamic Distractors

🎯 Extrapolation Suites Visualization

Suite Name	L0	L1	L2
Preposition Combinations
Task Workflows
Unseen Objects

📈 Long Horizon Suite Visualization

Suite Name	L0	L1	L2
Long Horizon

Installation

System Requirements

OS: Ubuntu 20.04+ or macOS 12+
Python: 3.9 or higher
CUDA: 11.8+ (for GPU acceleration)
RAM: 8GB minimum, 16GB recommended

Installation Steps

# Clone repository
git clone https://github.com/PKU-Alignment/VLA-Arena.git
cd VLA-Arena

# Create environment
conda create -n vla-arena python=3.10
conda activate vla-arena

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .

Documentation

VLA-Arena provides comprehensive documentation for all aspects of the framework. Choose the guide that best fits your needs:

📖 Core Guides

🏗️ Scene Construction Guide | 中文版

Build custom task scenarios using CBDDL.

CBDDL file structure
Object and region definitions
State and goal specifications
Constraints, safety predicates and costs
Scene visualization

📊 Data Collection Guide | 中文版

Collect demonstrations in custom scenes.

Interactive simulation environment
Keyboard controls for robotic arm
Data format conversion
Dataset creation and optimization

🔧 Model Fine-tuning Guide | 中文版

Fine-tune VLA models using VLA-Arena generated datasets.

OpenVLA fine-tuning
Training scripts and configuration
Model evaluation

🎯 Model Evaluation Guide | 中文版

Evaluate VLA models and adding custom models to VLA-Arena.

Quick start evaluation
Supported models (OpenVLA)
Custom model integration
Configuration options

🔜 Quick Reference

Fine-tuning Scripts

Standard: finetune_openvla.sh - Basic OpenVLA fine-tuning
Advanced: finetune_openvla_oft.sh - OpenVLA OFT with enhanced features

Documentation Index

English: README_EN.md - Complete English documentation index
中文: README_ZH.md - 完整中文文档索引

Leaderboard

OpenVLA-OFT Results (150,000 Training Steps and finetuned on VLA-Arena L0 datasets)

Overall Performance Summary

Model	L0 Success	L1 Success	L2 Success	Avg Success
OpenVLA-OFT	76.4%	36.3%	16.7%	36.5%

🛡️ Safety Performance

Task Suite	L0 Success	L1 Success	L2 Success	Avg Success
static_obstacles	100.0%	20.0%	20.0%	46.7%
cautious_grasp	60.0%	50.0%	0.0%	36.7%
hazard_avoidance	36.0%	0.0%	20.0%	18.7%
state_preservation	100.0%	76.0%	20.0%	65.3%
dynamic_obstacles	80.0%	56.0%	10.0%	48.7%

🛡️ Safety Cost Analysis

Task Suite	L1 Total Cost	L2 Total Cost	Avg Total Cost
static_obstacles	45.40	49.00	47.20
cautious_grasp	6.34	2.12	4.23
hazard_avoidance	22.91	14.71	18.81
state_preservation	7.60	4.60	6.10
dynamic_obstacles	3.66	1.84	2.75

🔄 Distractor Performance

Task Suite	L0 Success	L1 Success	L2 Success	Avg Success
robustness_static_distractors	100.0%	0.0%	20.0%	40.0%
robustness_dynamic_distractors	100.0%	54.0%	40.0%	64.7%

🎯 Extrapolation Performance

Task Suite	L0 Success	L1 Success	L2 Success	Avg Success
preposition_combinations	62.0%	18.0%	0.0%	26.7%
task_workflows	74.0%	0.0%	0.0%	24.7%
unseen_objects	60.0%	40.0%	20.0%	40.0%

📈 Long Horizon Performance

Task Suite	L0 Success	L1 Success	L2 Success	Avg Success
long_horizon	80.0%	0.0%	0.0%	26.7%

License

This project is licensed under the Apache 2.0 license - see LICENSE for details.

Acknowledgments

RoboSuite, LIBERO, and VLABench teams for the framework
OpenVLA, UniVLA, Openpi, and lerobot teams for pioneering VLA research
All contributors and the robotics community

VLA-Arena: Advancing Vision-Language-Action Models Through Comprehensive Evaluation
Made with ❤️ by the VLA-Arena Team

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Jan 4, 2026

0.0.4

Jan 4, 2026

This version

0.0.3

Dec 21, 2025

0.0.2

Dec 21, 2025

0.0.1

Dec 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vla_arena-0.0.3.tar.gz (201.8 kB view details)

Uploaded Dec 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vla_arena-0.0.3-py3-none-any.whl (279.5 kB view details)

Uploaded Dec 21, 2025 Python 3

File details

Details for the file vla_arena-0.0.3.tar.gz.

File metadata

Download URL: vla_arena-0.0.3.tar.gz
Upload date: Dec 21, 2025
Size: 201.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for vla_arena-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`24c720813f49338dce18f1b70871202a49e965a63e12e5876f04cf50dc64ae1d`
MD5	`c3de53446d82ae3cabc20d5219ef866e`
BLAKE2b-256	`4fdd9f8b2a0adf14b24b16f4482d43c160bebb41df4cf261cf3cf1b4c4051977`

See more details on using hashes here.

File details

Details for the file vla_arena-0.0.3-py3-none-any.whl.

File metadata

Download URL: vla_arena-0.0.3-py3-none-any.whl
Upload date: Dec 21, 2025
Size: 279.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.18

File hashes

Hashes for vla_arena-0.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6dc50f1b22632366be21f22efc092cf54a3e7678c3e91f8df33212f9e854f24`
MD5	`2921b3da4fb5d23a9df9e4127fd168ed`
BLAKE2b-256	`0f77b7b5849c59ed7298cfdb97c93e4733dc2ca2b1612d1c1672e6fcd574e626`

See more details on using hashes here.

vla-arena 0.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🤖 VLA-Arena: A Comprehensive Benchmark for Vision-Language-Action Models

📰 News

🔥 Highlights

📚 Table of Contents

Quick Start

1. Installation

Install from PyPI (Recommended)

Install from Source

Notes

2. Basic Evaluation

3. Data Collection

Task Suites Overview

🛡️ Safety (5 suites, 75 tasks)

🔄 Distractor (2 suites, 30 tasks)

🎯 Extrapolation (3 suites, 45 tasks)

📈 Long Horizon (1 suite, 20 tasks)

🛡️ Safety Suites Visualization

🔄 Distractor Suites Visualization

🎯 Extrapolation Suites Visualization

📈 Long Horizon Suite Visualization

Installation

System Requirements

Installation Steps

Documentation

📖 Core Guides

🏗️ Scene Construction Guide | 中文版

📊 Data Collection Guide | 中文版

🔧 Model Fine-tuning Guide | 中文版

🎯 Model Evaluation Guide | 中文版

🔜 Quick Reference

Fine-tuning Scripts

Documentation Index

Leaderboard

OpenVLA-OFT Results (150,000 Training Steps and finetuned on VLA-Arena L0 datasets)

Overall Performance Summary

🛡️ Safety Performance

🛡️ Safety Cost Analysis

🔄 Distractor Performance

🎯 Extrapolation Performance

📈 Long Horizon Performance

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes