Skip to main content

Evaluation harness for Vision-Language-Action models

Project description

vla-evaluation-harness

CI pypi License: Apache 2.0 Python 3.8+ Ruff Docker Images

Benchmarks LIBERO SimplerEnv CALVIN ManiSkill2 LIBERO-Pro RoboCasa VLABench MIKASA-Robo RoboTwin RLBench RoboCerebra LIBERO-Mem BEHAVIOR-1K Kinetix RoboMME FurnitureBench
Models (official) OpenVLA π₀ π₀-FAST GR00T N1.6 OFT X-VLA CogACT RTC MemVLA
Models (dexbotic) stars DB-CogACT
Models (starVLA) stars QwenGR00T QwenOFT QwenPI QwenFAST

One framework to evaluate any VLA model on any robot simulation benchmark.

Why vla-evaluation-harness?

Batch Parallel Evaluation Episode sharding + batched GPU inference → 47× throughput (2 000 LIBERO episodes in 18 min on 1× H100). Details
Zero Setup Benchmarks in Docker, model servers as single-file uv scripts — no dependency conflicts.
AI-Assisted Integration Built-in Claude Code skills for adding benchmarks and model servers — scaffold new integrations in minutes, not hours.
Leaderboard The largest unified VLA comparison — 500+ models × 17 benchmarks, aggregated from 1,700+ papers.

Motivation

VLA models are evaluated on LIBERO, CALVIN, SimplerEnv, ManiSkill, and others — but each benchmark has its own dependencies, observation format, and evaluation protocol. In practice, every research team ends up maintaining private eval forks per benchmark. Results diverge. Bug fixes don't propagate. No one tests under real-time conditions where the environment keeps moving during inference.

vla-evaluation-harness integrates the model once, integrates the benchmark once, and the full cross-evaluation matrix fills itself.

How: our abstraction layer fully decouples models from benchmarks.

  • Benchmarks run inside Docker — no dependency hell, exact reproducibility.
  • Model servers are standalone uv scripts with inline dependency declarations — zero manual setup.

See Architecture for how the pieces connect.


Installation

pip install vla-eval

Or from source:

git clone https://github.com/allenai/vla-evaluation-harness.git
cd vla-evaluation-harness
uv sync --python 3.11 --all-extras --dev

Quick Start

Two terminals: one for the model server (GPU), one for the benchmark client.

# Terminal 1 — model server (runs on host with GPU)
vla-eval serve --config configs/model_servers/dexbotic_cogact_libero.yaml

# Terminal 2 — run evaluation (benchmark runs in Docker by default)
vla-eval run --config configs/libero_smoke_test.yaml

Results are saved to results/ as JSON. The benchmark runs inside Docker by default — pass --no-docker for local development.

Full Evaluation

For full evaluation (10 tasks × 50 episodes):

vla-eval run --config configs/libero_spatial.yaml

See Reproduction Reports for verified scores and per-model details.

Need faster runs? See Batch Parallel Evaluation2 000 LIBERO episodes in ~18 min (47× vs sequential).


Batch Parallel Evaluation

A full evaluation takes hours sequentially. Two layers of parallelism bring this down to minutes:

Wall-clock evaluation time: sequential vs batch parallel across LIBERO (47×), CALVIN (16×), SimplerEnv (12×)

Episode sharding splits (task, episode) pairs across N independent processes (RFC-0006). Each shard connects to the same model server, where a BatchPredictModelServer batches their inference requests into a single forward pass. The two axes multiply together.

Episode Sharding (environment parallelism)

# Option A: use the helper script (launches all shards + auto-merges)
./scripts/run_sharded.sh -c configs/libero_spatial.yaml -n 50

# Option B: manual launch
vla-eval run -c configs/libero_spatial.yaml --shard-id 0 --num-shards 4 &
vla-eval run -c configs/libero_spatial.yaml --shard-id 1 --num-shards 4 &
# ... (each shard is a separate process)
wait
vla-eval merge -c configs/libero_spatial.yaml -o results/libero_spatial.json

Each shard gets a deterministic slice via round-robin. Results merge with episode-level deduplication — if a shard fails, re-run only that shard.

Batch Model Server (GPU parallelism)

Enable batching in the model server config by setting max_batch_size > 1:

args:
  max_batch_size: 16    # max observations per GPU forward pass (>1 enables batching)
  max_wait_time: 0.05   # seconds to wait before dispatching a partial batch

Tuning & Combined Effect

We tune parallelism via a demand/supply methodology: demand λ(N) measures environment throughput as a function of shards, supply μ(B) measures model throughput as a function of batch size. The operating point satisfies λ(N) < 80% · μ(B*) to prevent queue buildup.

Demand/supply throughput for LIBERO + CogACT on H100

Sharding and batching multiply together (DB-CogACT 7B, LIBERO Spatial, 1× H100-80GB):

Sequential Batch Parallel (50 shards, B=16)
Wall-clock ~14 h ~18 min
Throughput ~11 obs/s ~486 obs/s

2 000 episodes, 47× faster. The included benchmarking tools (experiments/bench_demand.py, experiments/bench_supply.py) measure λ and μ for any model + benchmark combination. See the Tuning Guide for worked examples and max_wait_time derivation.


Docker Images

All benchmark environments are packaged as standalone Docker images based on base.

Image Size Benchmark Python Base
base 3.3 GB nvidia/cuda:12.1.1-runtime-ubuntu22.04
rlbench 4.7 GB RLBench 3.8 base
simpler 4.9 GB SimplerEnv 3.10 base
libero 6.0 GB LIBERO 3.8 base
libero-pro 6.2 GB LIBERO-Pro 3.8 base
robocerebra 6.4 GB RoboCerebra 3.8 base
calvin 9.6 GB CALVIN 3.8 base
kinetix 10.0 GB Kinetix 3.11 base
maniskill2 9.8 GB ManiSkill2 3.10 base
mikasa-robo 10.1 GB MIKASA-Robo 3.10 base
libero-mem 11.3 GB LIBERO-Mem 3.8 base
robomme 17.0 GB RoboMME 3.11 base
vlabench 17.7 GB VLABench 3.10 base
robotwin 28.6 GB RoboTwin 2.0 3.10 base
robocasa 35.6 GB RoboCasa 3.11 base

Pull (recommended):

docker pull ghcr.io/allenai/vla-evaluation-harness/libero:latest

Build locally (see docker/build.sh):

docker/build.sh          # build all (base first, then benchmarks)
docker/build.sh libero   # build one

Documentation

Document Description
Architecture Component descriptions, protocol, episode flow, configuration
Contributing Dev setup, adding benchmarks/models, PR workflow
Reproduction Reports Per-model evaluation results and reproducibility verdicts
RFCs Design proposals with rationale and status tracking
Design Philosophy Freshness, Convenience, Layered Abstraction, Quality, Reproducibility, Openness

Contributing

See CONTRIBUTING.md for dev setup and PR workflow.

PRs for any 🔜 item in the support matrix are welcome.


Citation

If you find this work useful, please cite:

@article{choi2026vlaeval,
  title={vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models},
  author={Choi, Suhwan and Lee, Yunsung and Park, Yubeen and Kim, Chris Dongjoo and Krishna, Ranjay and Fox, Dieter and Yu, Youngjae},
  journal={arXiv preprint arXiv:2603.13966},
  year={2026}
}

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vla_eval-0.1.0.tar.gz (877.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vla_eval-0.1.0-py3-none-any.whl (168.1 kB view details)

Uploaded Python 3

File details

Details for the file vla_eval-0.1.0.tar.gz.

File metadata

  • Download URL: vla_eval-0.1.0.tar.gz
  • Upload date:
  • Size: 877.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for vla_eval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a96f7dc5e7131985bb00ba4abc1e40ecc77c64ae6d9f8d3a2c23f69751989efb
MD5 817ade05f185406cdbc001230cf14857
BLAKE2b-256 f1761d411afe4b5983e8a4c2b0a0109575a3d0cf88331bf5147a3ebc3c736c26

See more details on using hashes here.

File details

Details for the file vla_eval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vla_eval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 168.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for vla_eval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 43f02072f718b291eff30f23a8487b6d9036de2c3982ba4f6dc1ffb354e21893
MD5 2232eaece30b828380dcbc1eb8834960
BLAKE2b-256 dda5cac06e5243ef528d090916da1c331064d349a5327017b7530d22155ef0bd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page