No project description provided
Project description
Embodied Agent Interface (EAI): Benchmarking LLMs for Embodied Decision Making
Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu
Stanford Vision and Learning Lab, Stanford University
Dataset Highlights
- Standardized goal specifications.
- Standardized modules and interfaces.
- Broad coverage of evaluation and fine-grained metrics.
- Please find our dataset at this link.
- PDDL files for both BEHAVIOR (domain file, problem files) and VirtualHome (domain file, problem files).
Overview
We aim to evaluate Large Language Models (LLMs) for embodied decision-making. While many works leverage LLMs for decision-making in embodied environments, a systematic understanding of their performance is still lacking. These models are applied in different domains, for various purposes, and with diverse inputs and outputs. Current evaluations tend to rely on final success rates alone, making it difficult to pinpoint where LLMs fall short and how to leverage them effectively in embodied AI systems.
To address this gap, we propose the Embodied Agent Interface (EAI), which unifies:
- A broad set of embodied decision-making tasks involving both state and temporally extended goals.
- Four commonly used LLM-based modules: goal interpretation, subgoal decomposition, action sequencing, and transition modeling.
- Fine-grained evaluation metrics, identifying errors such as hallucinations, affordance issues, and planning mistakes.
Our benchmark provides a comprehensive assessment of LLM performance across different subtasks, identifying their strengths and weaknesses in embodied decision-making contexts.
Installation
-
Create and Activate a Conda Environment:
conda create -n eai-eval python=3.8 -y conda activate eai-eval
-
Install
eai
:You can install it from pip:
pip install eai-eval
Or, install from source:
git clone https://github.com/embodied-agent-interface/embodied-agent-interface.git cd embodied-agent-interface pip install -e .
-
(Optional) Install iGibson for behavior evaluation:
If you need to use
behavior_eval
, install iGibson. Follow these steps to minimize installation issues:-
Make sure you are using Python 3.8 and meet the minimum system requirements in the iGibson installation guide.
-
Install CMake using Conda (do not use pip):
conda install cmake
-
Install
iGibson
: We provide an installation script:python -m behavior_eval.utils.install_igibson_utils
Alternatively, install it manually:
git clone https://github.com/embodied-agent-interface/iGibson.git --recursive cd iGibson pip install -e .
-
Download assets:
python -m behavior_eval.utils.download_utils
We have successfully tested installation on Linux, Windows 10+, and macOS.
-
Quick Start
-
Arguments:
eai-eval \ --dataset {virtualhome,behavior} \ --mode {generate_prompts,evaluate_results} \ --eval-type {action_sequencing,transition_modeling,goal_interpretation,subgoal_decomposition} \ --llm-response-path <path_to_responses> \ --output-dir <output_directory> \ --num-workers <number_of_workers>
Run the following command for further information:
eai-eval --help
-
Examples:
-
Evaluate Results
Make sure to download our results first if you don't want to specify <path_to_responses>
python -m eai_eval.utils.download_utils
Then, run the commands below:
eai-eval --dataset virtualhome --eval-type action_sequencing --mode evaluate_results eai-eval --dataset virtualhome --eval-type transition_modeling --mode evaluate_results eai-eval --dataset virtualhome --eval-type goal_interpretation --mode evaluate_results eai-eval --dataset virtualhome --eval-type subgoal_decomposition --mode evaluate_results eai-eval --dataset behavior --eval-type action_sequencing --mode evaluate_results eai-eval --dataset behavior --eval-type transition_modeling --mode evaluate_results eai-eval --dataset behavior --eval-type goal_interpretation --mode evaluate_results eai-eval --dataset behavior --eval-type subgoal_decomposition --mode evaluate_results
-
Generate Pormpts
To generate prompts, you can run:
eai-eval --dataset virtualhome --eval-type action_sequencing --mode generate_prompts eai-eval --dataset virtualhome --eval-type transition_modeling --mode generate_prompts eai-eval --dataset virtualhome --eval-type goal_interpretation --mode generate_prompts eai-eval --dataset virtualhome --eval-type subgoal_decomposition --mode generate_prompts eai-eval --dataset behavior --eval-type action_sequencing --mode generate_prompts eai-eval --dataset behavior --eval-type transition_modeling --mode generate_prompts eai-eval --dataset behavior --eval-type goal_interpretation --mode generate_prompts eai-eval --dataset behavior --eval-type subgoal_decomposition --mode generate_prompts
-
Simulation
To see the effect of our magic actions, refer to this notebook.
-
Evaluate All Modules in One Command
To evaluate all modules with default parameters, use the command below:
eai-eval --all
This command will automatically traverse all unspecified parameter options.
Example Usage:
eai-eval --all --dataset virtualhome
This will run both
generate_prompts
andevaluate_results
for all modules in thevirtualhome
dataset. Make sure to download our results first if you don't want to specify <path_to_responses>
Docker
We provide a ready-to-use Docker image for easy installation and usage.
First, pull the Docker image from Docker Hub:
docker pull jameskrw/eai-eval
Next, run the Docker container interactively:
docker run -it jameskrw/eai-eval
Test docker
eai-eval
By default, this will start generating prompts for goal interpretation in Behavior.
BibTex
If you find our work helpful, please consider citing it:
@inproceedings{li2024embodied,
title={Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making},
author={Li, Manling and Zhao, Shiyu and Wang, Qineng and Wang, Kangrui and Zhou, Yu and Srivastava, Sanjana and Gokmen, Cem and Lee, Tony and Li, Li Erran and Zhang, Ruohan and others},
booktitle={NeurIPS 2024},
year={2024}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file eai_eval-1.0.4.tar.gz
.
File metadata
- Download URL: eai_eval-1.0.4.tar.gz
- Upload date:
- Size: 22.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8ac4d569177247314dcd8768804f0d317b522c91f94031e32c7a3a8a8a5de8ee |
|
MD5 | 560c4d863784ccd48775b2e2ff7e9299 |
|
BLAKE2b-256 | 5fc935dfcb5ea2e016646cfaa635733a2021d6ff8cb1c1edcb8720000c976701 |
File details
Details for the file eai_eval-1.0.4-py3-none-any.whl
.
File metadata
- Download URL: eai_eval-1.0.4-py3-none-any.whl
- Upload date:
- Size: 27.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.20
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3fe98ca6148e21734306bcfd6f780c0fade4edaa6e55d986c8fd70037686e2fc |
|
MD5 | 4cfc6512f7eee46dc91b082aee46254a |
|
BLAKE2b-256 | 7da676764d520b06f4dd5335de7abfef40d959416b2f3ce3c8d0ceefa6482fb5 |