Skip to main content

VLA evaluation harness across simulators with hard-fail spec contracts and hierarchical-mode evaluation

Project description

roboeval

License: BSD-3-Clause Python 3.10+ CI

roboeval is a CLI-driven evaluation harness for running VLAs against simulator backends through isolated HTTP services. It provides an ActionObsSpec compatibility gate before episode execution, per-component virtual environments for dependency isolation, sharded result collection, and built-in support for LITEN-style hierarchical evaluation in which a VLM planner issues subtask instructions to a low-level VLA.

Method / Contracts

roboeval treats each VLA and simulator as an independently launched component. The orchestrator communicates with a VLA policy server and a simulator worker over HTTP/JSON, validates their declared contracts, and records episode-level results from a reproducible YAML run config.

The main contract surfaces are:

Surface Role
ActionObsSpec gate VLA and simulator components declare action format, dimensionality, value range, camera roles, image format, state layout, and language inputs. Under the default strict mode, incompatible declarations stop the run before episode 1.
Host-process isolation VLA servers, simulator workers, and optional VLM proxy processes run in separate .venvs/ environments. This allows different Python and CUDA dependency stacks to coexist without a monolithic runtime.
Dependency isolation Each VLA and simulator keeps its upstream package pins, Python version, CUDA assumptions, and optional micromamba/uv environment separate. This is a design choice: adding a new backend should not force the orchestrator or other backends onto the same dependency closure.
LITEN-style hierarchical evaluation The hierarchical mode integrates the VLM-planner method introduced by Shah et al. (Learning Affordances at Inference-Time for Vision-Language-Action Models). The planner emits subtask calls that are executed by the same VLA server interface used for direct evaluation. roboeval is, to our knowledge, the first public VLA evaluation harness to ship a working LITEN integration.
Result records roboeval run writes JSON with harness version, config snapshot, per-episode metadata, success flags, and optional shard metadata.

Documentation map

For a compact system overview, design rationale, supported-pair notes, tuning guidance, related systems, and decision records, see architecture, design, supported pairs, tuning, related work, and the RFC index.

Installation

For full prerequisites, platform notes, and per-component dependency details, see docs/install.md.

git clone https://github.com/KE7/roboeval.git
cd roboeval
roboeval setup pi05 libero

The setup script provisions the orchestrator plus the requested VLA and simulator environments under .venvs/.

Quickstart

roboeval setup pi05 libero
roboeval serve --vla pi05 --sim libero --headless
roboeval test --validate -c configs/libero_spatial_pi05_smoke.yaml
roboeval run -c configs/libero_spatial_pi05_smoke.yaml

serve launches the selected VLA and simulator workers. run executes the YAML configuration, including the declared VLA/simulator pair, task suite, episode count, server URLs, output directory, and optional LITEN endpoint. Additional examples are in docs/quickstart.md.

Supported VLAs and Simulators

The table describes shipped coverage. It is a support matrix, not a benchmark table; supported pairs are tested end-to-end.

VLA Simulator Coverage Example config
Pi0.5 LIBERO direct, LITEN configs/libero_spatial_pi05_smoke.yaml, configs/libero_spatial_pi05_liten_smoke.yaml
Pi0.5 LIBERO-Pro direct, LITEN configs/libero_pro_pi05_smoke.yaml, configs/libero_pro_pi05_liten_smoke.yaml
Pi0.5 LIBERO-Infinity direct, LITEN configs/libero_infinity_pi05_smoke.yaml, configs/libero_infinity_pi05_liten_smoke.yaml
SmolVLA LIBERO direct, LITEN configs/libero_object_smolvla_smoke.yaml, configs/libero_object_smolvla_liten_smoke.yaml
OpenVLA LIBERO direct, LITEN configs/libero_spatial_openvla_smoke.yaml, configs/libero_spatial_openvla_liten_smoke.yaml
GR00T LIBERO direct, LITEN configs/libero_spatial_groot_smoke.yaml, configs/libero_spatial_groot_liten_smoke.yaml
InternVLA RoboTwin direct, LITEN configs/robotwin_internvla_smoke.yaml, configs/robotwin_internvla_liten_smoke.yaml
ACT ALOHA Gym direct, LITEN configs/aloha_gym_act_smoke.yaml, configs/aloha_gym_act_liten_smoke.yaml
Diffusion Policy gym-pusht direct configs/gym_pusht_diffusion_policy_smoke.yaml
VQ-BeT gym-pusht direct configs/gym_pusht_vqbet_smoke.yaml
TDMPC2 Meta-World direct configs/metaworld_tdmpc2_smoke.yaml
InternVLA ALOHA Gym CI smoke configs/ci/aloha_gym_internvla_smoke.yaml
ManiSkill2 ManiSkill2 backend backend scaffold; x86_64 execution path setup target maniskill2
RoboCasa RoboCasa backend simulator backend and registry support setup target robocasa

Supported VLA launch names are pi05, vqbet, tdmpc2, smolvla, openvla, cosmos, groot, and internvla. Supported simulator launch names are libero, libero_pro, libero_infinity, robocasa, robotwin, aloha_gym, gym_pusht, maniskill2, and metaworld.

Current limitations

  • ManiSkill2 is platform-blocked on aarch64 because the required SAPIEN 2.x wheels are x86_64-only.
  • bridge_octo is platform-blocked on aarch64 by its current TensorFlow/dlimp dependency chain and does not ship in the v0.1.0 support matrix.
  • Some technically expressible pairs remain capability boundaries and do not ship root configs, including RoboCasa x GR00T.

Planned features

  • Multi-architecture CI matrix. aarch64 is currently the primary CI path; x86_64 execution paths exist but are not in the CI matrix.
  • Additional VLAs as their checkpoints become available.
  • More simulators. Community contributions are welcome; see docs/extending.md.

Extending

Extension cost. Adding a new VLA averages ~200 SLOC; adding a new simulator backend averages ~230 SLOC (across the v0.1.0 release; excludes blank lines, comments, and docstrings).

  • Add a VLA by implementing a policy server with /health, /info, /reset, and /predict, then registering it with roboeval serve.
  • Add a simulator by implementing a SimBackendBase backend with /init, /reset, /step, /obs, /success, and /info support through the sim worker.
  • Add a new compatibility path by declaring ActionObsSpec records on both sides and adding a smoke config under configs/.

See docs/extending.md for the extension architecture and step-by-step entry points.

Citations

If you use roboeval in your research, please cite us.

@software{elmaaroufi2026roboeval,
  title   = {roboeval: A reproducible evaluation harness for Vision-Language-Action models},
  author  = {Elmaaroufi, Karim and OMAR and Seshia, Sanjit A. and Zaharia, Matei},
  version = {0.1.0},
  date    = {2026-04-29},
  url     = {https://github.com/KE7/roboeval},
  license = {BSD-3-Clause}
}

License

roboeval is released under the BSD-3-Clause License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roboeval-0.1.0.tar.gz (260.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

roboeval-0.1.0-py3-none-any.whl (231.4 kB view details)

Uploaded Python 3

File details

Details for the file roboeval-0.1.0.tar.gz.

File metadata

  • Download URL: roboeval-0.1.0.tar.gz
  • Upload date:
  • Size: 260.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for roboeval-0.1.0.tar.gz
Algorithm Hash digest
SHA256 c5a043269bef7451174a79d98023574048c11f1cffddbbabdc09207a9a5ac766
MD5 7a632c65cbee2c77b9856473909608e9
BLAKE2b-256 daec1c00e75d56b161b9de4ac7750217e9820ccf53c95ee816e8a94d7a82e77d

See more details on using hashes here.

Provenance

The following attestation bundles were made for roboeval-0.1.0.tar.gz:

Publisher: publish.yml on KE7/roboeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file roboeval-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: roboeval-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 231.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for roboeval-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 83dc09e44fc63a1e5108a7dfe939b410612ddaf67b702b8ab987c1c9629cacda
MD5 2aab40668711e3a95304828c878b73cd
BLAKE2b-256 964706e903937d08ae76eb1d5b4a67a9d76c8da519f4aec23f16686c77c0fe7a

See more details on using hashes here.

Provenance

The following attestation bundles were made for roboeval-0.1.0-py3-none-any.whl:

Publisher: publish.yml on KE7/roboeval

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page