Skip to main content

Plancraft: an evaluation dataset for planning with LLM agents

Project description

plancraft

Test Python Version Ruff PyPI Version Docker Pulls License GitHub Repo stars

Paper | Website

Plancraft is a minecraft environment and agent that innovates on planning LLM agents with an oracle RAG retriever.

You can install the package by running the following command:

pip install plancraft

Or:

uv add plancraft

gif-example3 gif-example1 gif-example2 gif-example3

The package provides a multimodal environment and dataset for evaluating planning agents. The dataset consists of examples of crafting tasks in Minecraft, where the agent must craft a target object from a set of initial items. The environment is a simplified version of Minecraft where the agent can move items between slots in an inventory and smelt items to create new items. The agent must craft the target object by moving or smelting items around in the inventory.

Usage

The package provides a PlancraftEnvironment class that can be used to interact with the environment. Here is an example of how to use it:

from plancraft.environments.env import PlancraftEnvironment


def main():
    # Create the environment with an inventory containing 10 iron ores and 23 oak logs
    env = PlancraftEnvironment(
        inventory={
          10: dict(type="iron_ore", quantity=10),
          23: dict(type="oak_log", quantity=23)
        }
    )
    # move one log to slot 1
    move_action = MoveAction(from_slot=23, to_slot=1, quantity=1)
    observation = env.step(move_action)
    # observation["inventory"] contains the updated symbolic inventory
    # observation["image"] contains the updated image of the inventory

    # smelt one iron ore
    smelt_action = SmeltAction(from_slot=10, to_slot=11, quantity=1)
    observation = env.step(smelt_action)

    # no op
    observation = env.step()

Note that the environment is deterministic and stateful, so the same action will always lead to the same observation and the environment will keep track of the state of the inventory.

Evaluator

The package also provides an Evaluator class that can be used to evaluate the performance of an agent on our specific dataset. Here is an example of how to use it:

from plancraft.evaluator import Evaluator

def main():
    # create model -- Note you can create your own model by subclassing PlancraftBaseModel
    model = get_model("dummy")
    # Create the evaluator
    evaluator = Evaluator(run_name="dummy", model=model)
    # Evaluate the agent
    evaluator.eval_all_examples()

The evaluator class handles the environment loop and model interaction. The environment is created based on the configuration and the examples are loaded from the dataset. The Evaluator uses the dataset examples and initializes the environment with the example's inventory. It is also responsible for early stopping and verifying the target object has been craft. Finally, it also saves the results of the evaluation and the images generated during the evaluation.

The Evaluator interactive loop

The evaluator loop for each example is as follows:

# Initialize success and non-environment actions counter
success = False
num_non_env_actions = 0

# Reset the environment and example
reset(example)

# Run the evaluation loop
while not history.check_stuck() and history.num_steps < max_steps:
    if action == StopAction:  # StopAction ends the episode
        success = example.impossible  # Success if task is impossible
        break
    elif isinstance(action, str) and num_non_env_actions < 3:  
        # Handle external tool action (str message)
        observation = {"message": action}
        num_non_env_actions += 1
    else:  
        # Handle environment action
        if isinstance(action, str):  
            # Handle invalid case (exceeded non-env action limit)
            observation = environment.step()
        else:
            observation = environment.step(action)

        # Convert observation to message and reset non-env counter
        observation["target"] = example.target
        observation["message"] = convert_observation_to_message(observation)
        num_non_env_actions = 0

        # Check if episode is complete
        success = check_done(observation["inventory"], example.target)

    if success:  # Exit loop if success
        break

    # Update history with observation and message
    history.add_observation_to_history(observation)
    history.add_message_to_history(content=observation["message"], role="user")
    # Model predicts next action
    raw_action = model.step(observation, dialogue_history=history)
    # Update history with predicted action
    history.add_message_to_history(content=raw_action, role="assistant")
    # Parse raw action into a structured format
    action = parse_raw_model_response(raw_action)

# Return results after evaluation
return {
    "success": success,
    "recipe_type": example.recipe_type,
    "complexity": example.complexity,
    "number_of_steps": history.num_steps,
    "model_trace": history.trace(),
    "example_id": example.id,
}

Observation

The observation returned by the PlancraftEnvironment class is a dictionary with the following keys: inventory and image. The inventory key contains a dictionary with the slot number as the key and the item in the slot as the value (eg {"type": "iron_ingot", "quantity": 2}). The image key contains a numpy array representing the image of the inventory.

The observation returned by the Evaluator class is a dictionary with the following keys: inventory, image, message, and target. The message key contains a string representing the environment formatted in text (we follow the annotation scheme described in our paper). The target key contains a string representing the target object to be crafted.

Implementing a Model

To implement a model, you need to subclass the PlancraftBaseModel class and implement the step and reset method. See the plancraft.models.dummy module for an example of how to implement a basic model.

You should then be able to use the Evaluator class to evaluate it.

Reproducing the Results tables in the paper

To reproduce the results tables in the paper, you can use the exps.sh script in the root directory. The script will run the evaluation for all the models and seeds specified in the paper. The results will be saved in output directory but also on wandb if you have an account and set the WANDB_API_KEY environment variable.

Docker

There is a docker image built to incorporate the latest code and its dependencies. I build it by running the following command:

docker buildx build --platform linux/amd64,linux/arm64 -t gautierdag/plancraft --push .

The image is available on Docker Hub. Note that, unlike the package, the docker image includes everything in the repo.

To Do

Non-exhaustive list of things to do from highest to lowest priority:

  • Add minecraft wiki scrape and non-oracle search for pages
  • Improve planner to bring closer to optimal (the oracle planner does not consider future crafting steps when moving items -- see paper for more details)
  • Rerun image models with better bounding box model
    • Track bounding box accuracy
  • Implement a version of the image environment entirely on cuda/pytorch rather than cpu

PRs Welcomed

If you would like to contribute to the project, please feel free to open a PR. I am happy to review and merge PRs that improve the project. If you have any questions, feel free to create an issue or reach out to me directly.

Citation

@misc{dagan2024plancraftevaluationdatasetplanning,
      title={Plancraft: an evaluation dataset for planning with LLM agents}, 
      author={Gautier Dagan and Frank Keller and Alex Lascarides},
      year={2024},
      eprint={2412.21033},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.21033}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plancraft-0.3.19.tar.gz (22.3 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

plancraft-0.3.19-py3-none-any.whl (2.9 MB view details)

Uploaded Python 3

File details

Details for the file plancraft-0.3.19.tar.gz.

File metadata

  • Download URL: plancraft-0.3.19.tar.gz
  • Upload date:
  • Size: 22.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.28

File hashes

Hashes for plancraft-0.3.19.tar.gz
Algorithm Hash digest
SHA256 056458db7a55bd008d1a6552ea6dc2f2e82a7a9889ed983571d9b167d5142b03
MD5 66f3df27bdcae42cfd3dd27277757aa2
BLAKE2b-256 94c9d14d87485ec932551bd4c77eb02193cdf338e1884cdaa77ca65e7cf59043

See more details on using hashes here.

File details

Details for the file plancraft-0.3.19-py3-none-any.whl.

File metadata

  • Download URL: plancraft-0.3.19-py3-none-any.whl
  • Upload date:
  • Size: 2.9 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.5.28

File hashes

Hashes for plancraft-0.3.19-py3-none-any.whl
Algorithm Hash digest
SHA256 7ae3ed5d59cf8e4cfec1dd25af3fbb6b9f882560db91488e590bd387cafc1169
MD5 7207fb2fa047e80b8a5dc511771bc10b
BLAKE2b-256 62accf4f39e87c31ba38ff17b52ca71c0542b8dd42ea244a4b02ee1ee6a96385

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page