Skip to main content

No project description provided

Project description

Embodied Agent Interface (EAI): Benchmarking LLMs for Embodied Decision Making

arXiv Website Download the EmbodiedAgentInterface Dataset from Hugging Face Docker Docs License: MIT

Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, Weiyu Liu, Percy Liang, Li Fei-Fei, Jiayuan Mao, Jiajun Wu

Stanford Vision and Learning Lab, Stanford University

EAgent

Dataset Highlights

Overview

We aim to evaluate Large Language Models (LLMs) for embodied decision-making. While many works leverage LLMs for decision-making in embodied environments, a systematic understanding of their performance is still lacking. These models are applied in different domains, for various purposes, and with diverse inputs and outputs. Current evaluations tend to rely on final success rates alone, making it difficult to pinpoint where LLMs fall short and how to leverage them effectively in embodied AI systems.

To address this gap, we propose the Embodied Agent Interface (EAI), which unifies:

  1. A broad set of embodied decision-making tasks involving both state and temporally extended goals.
  2. Four commonly used LLM-based modules: goal interpretation, subgoal decomposition, action sequencing, and transition modeling.
  3. Fine-grained evaluation metrics, identifying errors such as hallucinations, affordance issues, and planning mistakes.

Our benchmark provides a comprehensive assessment of LLM performance across different subtasks, identifying their strengths and weaknesses in embodied decision-making contexts.

Installation

  1. Create and Activate a Conda Environment:

    conda create -n eai-eval python=3.8 -y 
    conda activate eai-eval
    
  2. Install eai:

    You can install it from pip:

    pip install eai-eval
    

    Or, install from source:

    git clone https://github.com/embodied-agent-interface/embodied-agent-interface.git
    cd embodied-agent-interface
    pip install -e .
    
  3. (Optional) Test PDDL planner for transition modeling: If you want to evaluate transition_modeling, it is highly recommended to test the installation of PDDL planner. You can test by running

    python examples/pddl_tester.py
    

    If the output is Results: ['walk_towards character light', 'switch_on character light'], the installation is successful. Otherwise, you can refer to the BUILD.md under pddlgym_planners/ or this for more instructions.

  4. (Optional) Install iGibson for behavior evaluation:

    If you need to use behavior_eval, install iGibson. Follow these steps to minimize installation issues:

    • Make sure you are using Python 3.8 and meet the minimum system requirements in the iGibson installation guide.

    • Install CMake using Conda (do not use pip):

      conda install cmake
      
    • Install iGibson: We provide an installation script:

      python -m behavior_eval.utils.install_igibson_utils
      

      Alternatively, install it manually:

      git clone https://github.com/embodied-agent-interface/iGibson.git --recursive
      cd iGibson
      pip install -e .
      
    • Download assets:

      python -m behavior_eval.utils.download_utils
      

    We have successfully tested installation on Linux, Windows 10+, and macOS.

Quick Start

  1. Arguments:

    eai-eval \
      --dataset {virtualhome,behavior} \
      --mode {generate_prompts,evaluate_results} \
      --eval-type {action_sequencing,transition_modeling,goal_interpretation,subgoal_decomposition} \
      --llm-response-path <path_to_responses> \
      --output-dir <output_directory> \
      --num-workers <number_of_workers>
    

    Run the following command for further information:

    eai-eval --help
    
  2. Examples:

  • Evaluate Results

    Make sure to download our results first if you don't want to specify <path_to_responses>

    python -m eai_eval.utils.download_utils
    

    Then, run the commands below:

    eai-eval --dataset virtualhome --eval-type action_sequencing --mode evaluate_results
    eai-eval --dataset virtualhome --eval-type transition_modeling --mode evaluate_results
    eai-eval --dataset virtualhome --eval-type goal_interpretation --mode evaluate_results
    eai-eval --dataset virtualhome --eval-type subgoal_decomposition --mode evaluate_results
    eai-eval --dataset behavior --eval-type action_sequencing --mode evaluate_results
    eai-eval --dataset behavior --eval-type transition_modeling --mode evaluate_results
    eai-eval --dataset behavior --eval-type goal_interpretation --mode evaluate_results
    eai-eval --dataset behavior --eval-type subgoal_decomposition --mode evaluate_results
    
  • Generate Pormpts

    To generate prompts, you can run:

    eai-eval --dataset virtualhome --eval-type action_sequencing --mode generate_prompts
    eai-eval --dataset virtualhome --eval-type transition_modeling --mode generate_prompts
    eai-eval --dataset virtualhome --eval-type goal_interpretation --mode generate_prompts
    eai-eval --dataset virtualhome --eval-type subgoal_decomposition --mode generate_prompts
    eai-eval --dataset behavior --eval-type action_sequencing --mode generate_prompts
    eai-eval --dataset behavior --eval-type transition_modeling --mode generate_prompts
    eai-eval --dataset behavior --eval-type goal_interpretation --mode generate_prompts
    eai-eval --dataset behavior --eval-type subgoal_decomposition --mode generate_prompts
    
  • Simulation

    To see the effect of our magic actions, refer to this notebook.

  1. Evaluate All Modules in One Command

    To evaluate all modules with default parameters, use the command below:

    eai-eval --all
    

    This command will automatically traverse all unspecified parameter options.

    Example Usage:

    eai-eval --all --dataset virtualhome
    

    This will run both generate_prompts and evaluate_results for all modules in the virtualhome dataset. Make sure to download our results first if you don't want to specify <path_to_responses>

Docker

We provide a ready-to-use Docker image for easy installation and usage.

First, pull the Docker image from Docker Hub:

docker pull jameskrw/eai-eval

Next, run the Docker container interactively:

docker run -it jameskrw/eai-eval

Test docker

eai-eval

By default, this will start generating prompts for goal interpretation in Behavior.

BibTex

If you find our work helpful, please consider citing it:

@inproceedings{li2024embodied,
  title={Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making},
  author={Li, Manling and Zhao, Shiyu and Wang, Qineng and Wang, Kangrui and Zhou, Yu and Srivastava, Sanjana and Gokmen, Cem and Lee, Tony and Li, Li Erran and Zhang, Ruohan and others},
  booktitle={NeurIPS 2024},
  year={2024}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eai_eval-1.0.5.tar.gz (22.6 MB view details)

Uploaded Source

Built Distribution

eai_eval-1.0.5-py3-none-any.whl (27.1 MB view details)

Uploaded Python 3

File details

Details for the file eai_eval-1.0.5.tar.gz.

File metadata

  • Download URL: eai_eval-1.0.5.tar.gz
  • Upload date:
  • Size: 22.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.21

File hashes

Hashes for eai_eval-1.0.5.tar.gz
Algorithm Hash digest
SHA256 61a13f38b8e60414540b44cbcb51e3b8cddd2fba9150fc0965e089940715dabf
MD5 31b66929c1dcf9ea87beba6c41d27617
BLAKE2b-256 62aebe6df7f55b175dec2564662bd95c2d93fed7373b63cb8f567ee77b8c114b

See more details on using hashes here.

File details

Details for the file eai_eval-1.0.5-py3-none-any.whl.

File metadata

  • Download URL: eai_eval-1.0.5-py3-none-any.whl
  • Upload date:
  • Size: 27.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.9.21

File hashes

Hashes for eai_eval-1.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 fe09f2a84cfb2451e111d02a2fa57e18797e747b73de23cb036d0d4e6502099f
MD5 9c204e6367bd2daf39a175cdcb1d7585
BLAKE2b-256 d1f12fa1e403b335e575e04b56b6331f86f3ddee2424f9d34dff99b983b4d45b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page