Skip to main content

Plancraft: an evaluation dataset for planning with LLM agents

Project description

plancraft

Test Python Version Ruff PyPI Version Docker Pulls License GitHub Repo stars

Paper | Website

Plancraft is a minecraft environment and agent that innovates on planning LLM agents with an oracle RAG retriever.

You can install the package by running the following command:

pip install plancraft

gif-example3 gif-example1 gif-example2 gif-example3

The package provides a multimodal environment and dataset for evaluating planning agents. The dataset consists of examples of crafting tasks in Minecraft, where the agent must craft a target object from a set of initial items. The environment is a simplified version of Minecraft where the agent can move items between slots in an inventory and smelt items to create new items. The agent must craft the target object by moving or smelting items around in the inventory.

Usage

The package provides a PlancraftEnvironment class that can be used to interact with the environment. Here is an example of how to use it:

from plancraft.environments.env import PlancraftEnvironment


def main():
    # Create the environment with an inventory containing 10 iron ores and 23 oak logs
    env = PlancraftEnvironment(
        inventory={
          10: dict(type="iron_ore", quantity=10),
          23: dict(type="oak_log", quantity=23)
        }
    )
    # move one log to slot 1
    move_action = MoveAction(from_slot=23, to_slot=1, quantity=1)
    observation = env.step(move_action)
    # observation["inventory"] contains the updated symbolic inventory
    # observation["image"] contains the updated image of the inventory

    # smelt one iron ore
    smelt_action = SmeltAction(from_slot=10, to_slot=11, quantity=1)
    observation = env.step(smelt_action)

    # no op
    observation = env.step()

Note that the environment is deterministic and stateful, so the same action will always lead to the same observation and the environment will keep track of the state of the inventory.

Evaluator

The package also provides an Evaluator class that can be used to evaluate the performance of an agent on our specific dataset. Here is an example of how to use it:

from plancraft.evaluator import Evaluator
from plancraft.config import EvalConfig

def main():
    # Create the config
    config = EvalConfig(...)
    # create model -- Note you can create your own model by subclassing PlancraftBaseModel
    model = get_model(config)
    # Create the evaluator
    evaluator = Evaluator(config, model=model)
    # Evaluate the agent
    evaluator.eval_all_seeds()

The evaluator class handles the environment loop and model interaction. The environment is created based on the configuration and the examples are loaded from the dataset. The Evaluator uses the dataset examples and initializes the environment with the example's inventory. It is also responsible for early stopping and verifying the target object has been craft. Finally, it also saves the results of the evaluation and the images generated during the evaluation.

The Evaluator interactive loop

The evaluator loop for each example is as follows:

# Initialize success and non-environment actions counter
success = False
num_non_env_actions = 0

# Reset the environment and example
reset(example)

# Run the evaluation loop
while not history.check_stuck() and history.num_steps < max_steps:
    if action == StopAction:  # StopAction ends the episode
        success = example.impossible  # Success if task is impossible
        break
    elif isinstance(action, str) and num_non_env_actions < 3:  
        # Handle external tool action (str message)
        observation = {"message": action}
        num_non_env_actions += 1
    else:  
        # Handle environment action
        if isinstance(action, str):  
            # Handle invalid case (exceeded non-env action limit)
            observation = environment.step()
        else:
            history.add_action_to_history(action)  # Add action to history
            observation = environment.step(action)

        # Convert observation to message and reset non-env counter
        observation["target"] = example.target
        observation["message"] = convert_observation_to_message(observation)
        num_non_env_actions = 0

        # Check if episode is complete
        success = check_done(observation["inventory"], example.target)

    # Update history with observation and message
    history.add_observation_to_history(observation)
    history.add_message_to_history(content=observation["message"], role="user")

    if success:  # Exit loop if success
        break

    # Model predicts next action
    raw_action = model.step(observation, dialogue_history=history)

    # Update history with predicted action
    history.add_message_to_history(content=raw_action, role="assistant")

    # Parse raw action into a structured format
    action = parse_raw_model_response(raw_action)

# Return results after evaluation
return {
    "success": success,
    "recipe_type": example.recipe_type,
    "complexity": example.complexity,
    "number_of_steps": history.num_steps,
    "model_trace": history.trace(),
    "example_id": example.id,
    "impossible": example.impossible,
}

Observation

The observation returned by the PlancraftEnvironment class is a dictionary with the following keys: inventory and image. The inventory key contains a dictionary with the slot number as the key and the item in the slot as the value (eg {"type": "iron_ingot", "quantity": 2}). The image key contains a numpy array representing the image of the inventory.

The observation returned by the Evaluator class is a dictionary with the following keys: inventory, image, message, and target. The message key contains a string representing the environment formatted in text (we follow the annotation scheme described in our paper). The target key contains a string representing the target object to be crafted.

Implementing a Model

To implement a model, you need to subclass the PlancraftBaseModel class and implement the step and reset method. See the plancraft.models.dummy module for an example of how to implement a basic model.

You should then be able to use the Evaluator class to evaluate it.

Reproducing the Results tables in the paper

To reproduce the results tables in the paper, you can use the exps.sh script in the root directory. The script will run the evaluation for all the models and seeds specified in the paper. The results will be saved in output directory but also on wandb if you have an account and set the WANDB_API_KEY environment variable.

Docker

There is a docker image built to incorporate the latest code and its dependencies. I build it by running the following command:

docker buildx build --platform linux/amd64,linux/arm64 -t gautierdag/plancraft --push .

The image is available on Docker Hub. Note that, unlike the package, the docker image includes everything in the repo.

To Do

Non-exhaustive list of things to do from highest to lowest priority:

  • Add minecraft wiki scrape and non-oracle search for pages
  • Improve planner to bring closer to optimal (the oracle planner does not consider future crafting steps when moving items -- see paper for more details)
  • Rerun image models with better bounding box model
    • Track bounding box accuracy
  • Implement a version of the image environment entirely on cuda/pytorch rather than cpu

PRs Welcomed

If you would like to contribute to the project, please feel free to open a PR. I am happy to review and merge PRs that improve the project. If you have any questions, feel free to create an issue or reach out to me directly.

Citation

@misc{dagan2024plancraftevaluationdatasetplanning,
      title={Plancraft: an evaluation dataset for planning with LLM agents}, 
      author={Gautier Dagan and Frank Keller and Alex Lascarides},
      year={2024},
      eprint={2412.21033},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.21033}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plancraft-0.3.4.tar.gz (22.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

plancraft-0.3.4-py3-none-any.whl (2.9 MB view details)

Uploaded Python 3

File details

Details for the file plancraft-0.3.4.tar.gz.

File metadata

  • Download URL: plancraft-0.3.4.tar.gz
  • Upload date:
  • Size: 22.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.18

File hashes

Hashes for plancraft-0.3.4.tar.gz
Algorithm Hash digest
SHA256 67723c1a9690f0943e10708fbbd9f9a53579576b16732f37bcd4c1bdf0aee81c
MD5 0a4dfaa8968867fd052d68dcdedb7de3
BLAKE2b-256 127ee0bd2ac0c39325be2976eb63fadfa92dbce826cd831a7e5e54d7c18393dc

See more details on using hashes here.

File details

Details for the file plancraft-0.3.4-py3-none-any.whl.

File metadata

  • Download URL: plancraft-0.3.4-py3-none-any.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.18

File hashes

Hashes for plancraft-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 480b96cedb0a40c3fb92e8f946e7c2586888a6101f52598ab46769dd2be462e5
MD5 3975a2ec9ec2b0af70509ace3db67abc
BLAKE2b-256 e9f703c1ead0af85acdae6354d88fef4975683d729105a582280f91849d2e871

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page