Plancraft: an evaluation dataset for planning with LLM agents

Project description

plancraft

Python Version Ruff License GitHub Repo stars

Plancraft is a minecraft environment and agent that innovates on planning LLM agents with an oracle RAG retriever.

You can install the package by running the following command:

pip install plancraft

Or:

uv add plancraft

gif-example3 gif-example1 gif-example2

The package provides a multimodal environment and dataset for evaluating planning agents. The dataset consists of examples of crafting tasks in Minecraft, where the agent must craft a target object from a set of initial items. The environment is a simplified version of Minecraft where the agent can move items between slots in an inventory and smelt items to create new items. The agent must craft the target object by moving or smelting items around in the inventory.

Usage

The package provides a PlancraftEnvironment class that can be used to interact with the environment. Here is an example of how to use it:

from plancraft.environments.env import PlancraftEnvironment


def main():
    # Create the environment with an inventory containing 10 iron ores and 23 oak logs
    env = PlancraftEnvironment(
        inventory={
          10: dict(type="iron_ore", quantity=10),
          23: dict(type="oak_log", quantity=23)
        }
    )
    # move one log to slot 1
    move_action = MoveAction(from_slot=23, to_slot=1, quantity=1)
    observation = env.step(move_action)
    # observation["inventory"] contains the updated symbolic inventory
    # observation["image"] contains the updated image of the inventory

    # smelt one iron ore
    smelt_action = SmeltAction(from_slot=10, to_slot=11, quantity=1)
    observation = env.step(smelt_action)

    # no op
    observation = env.step()

Note that the environment is deterministic and stateful, so the same action will always lead to the same observation and the environment will keep track of the state of the inventory.

Evaluator

The package also provides an Evaluator class that can be used to evaluate the performance of an agent on our specific dataset. Here is an example of how to use it:

from plancraft.evaluator import Evaluator

def main():
    # create model -- Note you can create your own model by subclassing PlancraftBaseModel
    model = get_model("dummy")
    # Create the evaluator
    evaluator = Evaluator(run_name="dummy", model=model)
    # Evaluate the agent
    evaluator.eval_all_examples()

The evaluator class handles the environment loop and model interaction. The environment is created based on the configuration and the examples are loaded from the dataset. The Evaluator uses the dataset examples and initializes the environment with the example's inventory. It is also responsible for early stopping and verifying the target object has been craft. Finally, it also saves the results of the evaluation and the images generated during the evaluation.

The Evaluator interactive loop

The evaluator loop for each example is as follows:

# Initialize success and non-environment actions counter
success = False
num_non_env_actions = 0

# Reset the environment and example
reset(example)

# Run the evaluation loop
while not history.check_stuck() and history.num_steps < max_steps:
    if action == StopAction:  # StopAction ends the episode
        success = example.impossible  # Success if task is impossible
        break
    elif isinstance(action, str) and num_non_env_actions < 3:  
        # Handle external tool action (str message)
        observation = {"message": action}
        num_non_env_actions += 1
    else:  
        # Handle environment action
        if isinstance(action, str):  
            # Handle invalid case (exceeded non-env action limit)
            observation = environment.step()
        else:
            observation = environment.step(action)

        # Convert observation to message and reset non-env counter
        observation["target"] = example.target
        observation["message"] = convert_observation_to_message(observation)
        num_non_env_actions = 0

        # Check if episode is complete
        success = check_done(observation["inventory"], example.target)

    if success:  # Exit loop if success
        break

    # Update history with observation and message
    history.add_observation_to_history(observation)
    history.add_message_to_history(content=observation["message"], role="user")
    # Model predicts next action
    raw_action = model.step(observation, dialogue_history=history)
    # Update history with predicted action
    history.add_message_to_history(content=raw_action, role="assistant")
    # Parse raw action into a structured format
    action = parse_raw_model_response(raw_action)

# Return results after evaluation
return {
    "success": success,
    "recipe_type": example.recipe_type,
    "complexity": example.complexity,
    "number_of_steps": history.num_steps,
    "model_trace": history.trace(),
    "example_id": example.id,
}

Observation

The observation returned by the PlancraftEnvironment class is a dictionary with the following keys: inventory and image. The inventory key contains a dictionary with the slot number as the key and the item in the slot as the value (eg {"type": "iron_ingot", "quantity": 2}). The image key contains a numpy array representing the image of the inventory.

The observation returned by the Evaluator class is a dictionary with the following keys: inventory, image, message, and target. The message key contains a string representing the environment formatted in text (we follow the annotation scheme described in our paper). The target key contains a string representing the target object to be crafted.

Implementing a Model

To implement a model, you need to subclass the PlancraftBaseModel class and implement the step and reset method. See the plancraft.models.dummy module for an example of how to implement a basic model.

You should then be able to use the Evaluator class to evaluate it.

Reproducing the Results tables in the paper

To reproduce the results tables in the paper, you can use the exps.sh script in the root directory. The script will run the evaluation for all the models and seeds specified in the paper. The results will be saved in output directory but also on wandb if you have an account and set the WANDB_API_KEY environment variable.

Docker

There is a docker image built to incorporate the latest code and its dependencies. I build it by running the following command:

docker buildx build --platform linux/amd64,linux/arm64 -t gautierdag/plancraft --push .

The image is available on Docker Hub. Note that, unlike the package, the docker image includes everything in the repo.

To Do

Non-exhaustive list of things to do from highest to lowest priority:

Add minecraft wiki scrape and non-oracle search for pages
Improve planner to bring closer to optimal (the oracle planner does not consider future crafting steps when moving items -- see paper for more details)
Rerun image models with better bounding box model
- Track bounding box accuracy
Implement a version of the image environment entirely on cuda/pytorch rather than cpu

PRs Welcomed

If you would like to contribute to the project, please feel free to open a PR. I am happy to review and merge PRs that improve the project. If you have any questions, feel free to create an issue or reach out to me directly.

Citation

@misc{dagan2024plancraftevaluationdatasetplanning,
      title={Plancraft: an evaluation dataset for planning with LLM agents}, 
      author={Gautier Dagan and Frank Keller and Alex Lascarides},
      year={2024},
      eprint={2412.21033},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.21033}, 
}

Project details

Release history Release notifications | RSS feed

0.4.9

Jul 5, 2025

0.4.8

Jul 4, 2025

0.4.7

Jul 2, 2025

0.4.6

Jun 8, 2025

0.4.5

Jun 8, 2025

0.4.4

Jun 8, 2025

0.4.3

May 14, 2025

0.4.2

Apr 25, 2025

0.4.1

Mar 24, 2025

0.4.0

Mar 24, 2025

0.3.34

Mar 14, 2025

0.3.33

Feb 20, 2025

0.3.32

Feb 20, 2025

0.3.31

Feb 20, 2025

0.3.30

Feb 19, 2025

0.3.29

Feb 19, 2025

0.3.28

Feb 19, 2025

0.3.27

Feb 18, 2025

0.3.26

Feb 18, 2025

0.3.25

Feb 18, 2025

0.3.24

Feb 18, 2025

0.3.23

Feb 17, 2025

0.3.22

Feb 15, 2025

0.3.21

Feb 14, 2025

0.3.20

Feb 5, 2025

0.3.19

Feb 5, 2025

0.3.18

Feb 5, 2025

0.3.17

Feb 5, 2025

0.3.16

Jan 28, 2025

0.3.15

Jan 26, 2025

0.3.14

Jan 23, 2025

0.3.13

Jan 22, 2025

0.3.12

Jan 22, 2025

0.3.11

Jan 21, 2025

0.3.10

Jan 21, 2025

0.3.9

Jan 16, 2025

This version

0.3.8

Jan 16, 2025

0.3.7

Jan 16, 2025

0.3.6

Jan 15, 2025

0.3.5

Jan 15, 2025

0.3.4

Jan 14, 2025

0.3.3

Jan 14, 2025

0.3.2

Jan 14, 2025

0.3.1

Jan 10, 2025

0.3.0

Jan 1, 2025

0.2.0

Dec 3, 2024

0.1.4

Nov 21, 2024

0.1.3

Nov 21, 2024

0.1.2

Nov 21, 2024

0.1.1

Nov 19, 2024

0.1.0

Nov 19, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

plancraft-0.3.8.tar.gz (22.3 MB view details)

Uploaded Jan 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

plancraft-0.3.8-py3-none-any.whl (2.9 MB view details)

Uploaded Jan 16, 2025 Python 3

File details

Details for the file plancraft-0.3.8.tar.gz.

File metadata

Download URL: plancraft-0.3.8.tar.gz
Upload date: Jan 16, 2025
Size: 22.3 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.5.20

File hashes

Hashes for plancraft-0.3.8.tar.gz
Algorithm	Hash digest
SHA256	`43d451f5e8274f1b15e63bb5795935223b310f79a6ddd1beb937fb6d1e0af4d6`
MD5	`c263ad389ddbac870bb83a53e85f804f`
BLAKE2b-256	`2cdb5b46991080a916e2fffcb63895267bbfd19f5b09fd37bbda4421cc0c2304`

See more details on using hashes here.

File details

Details for the file plancraft-0.3.8-py3-none-any.whl.

File metadata

Download URL: plancraft-0.3.8-py3-none-any.whl
Upload date: Jan 16, 2025
Size: 2.9 MB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.5.20

File hashes

Hashes for plancraft-0.3.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e08600a287381412ff2641f223b17ea3faff6fbe31066e291fbf0ea8bf01b205`
MD5	`e978f3b3ae38151e4db01f5ad7acdabb`
BLAKE2b-256	`2a9138ee8f183c1429c34a551ea463531c9035639752a5a2fe5fa1d727092b21`

See more details on using hashes here.

plancraft 0.3.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

plancraft

Usage

Evaluator

The Evaluator interactive loop

Observation

Implementing a Model

Reproducing the Results tables in the paper

Docker

To Do

PRs Welcomed

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes