Add your description here
Project description
Craftax LM
A wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons.
Craftax-Classic
| LM | Algorithm | Score (% max) | Code |
|---|---|---|---|
| claude-3-7-sonnet-latest (default) | ReAct | 18.0 | |
| claude-3-5-sonnet-20241022 | ReAct | 17.8 | |
| claude-3-5-sonnet-20240620 | ReAct | 15.7 | |
| o3-mini | ReAct | 12.6 | |
| gpt-4o | ReAct | 7.0 |
- Note - this is a limited evaluation where trajectories are terminated after 30 api calls, or roughly 150 in-game steps. 10 trajectories are rolled-out, yielding a log-weighted score as per the Crafter paper. Reproducible code forthcoming.
Usage
First, download the package with pip install craftaxlm. Next, import the agent-computer interface of your choice via
from craftaxlm import CraftaxACI, CraftaxClassicACI
This package is early in development, so for implementation examples, please refer to the baseline ReAct implementation
Leaderboard
In order to make experiments reasonable to run across a range of LMs, currently the leaderboard evaluates agents in the following manner:
- Five rollouts are sampled from the agent, with a hard cap of 300 actions per rollout.
- The agent is evaluated using a modified version of the original Crafter score -
where P(1_achievement_obtained) is the probability of the achievement being obtained in a single rollout. The key idea is that incremental progress towards difficult achievements ought to weigh more heavily in the score.sum(ln(1 + P(1_achievement_obtained)) for achievement in achievements) / (sum(ln(2) * len(achievements)))
Craftax-Full
| LM | Algorithm | Score (% max) | Code |
|---|
Dev Instructions
pyenv virtualenv craftax_env
poetry install
When in doubt
from jax import debug
...
debug.breakpoint()
📚 Citation
To learn more about Craftax, check out the paper website here. To cite the underlying Craftax environment, see:
@inproceedings{matthews2024craftax,
author={Michael Matthews and Michael Beukman and Benjamin Ellis and Mikayel Samvelyan and Matthew Jackson and Samuel Coward and Jakob Foerster},
title = {Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning},
booktitle = {International Conference on Machine Learning ({ICML})},
year = {2024}
}
To cite the Crafter benchmark, see:
@article{hafner2021crafter,
title={Benchmarking the Spectrum of Agent Capabilities},
author={Danijar Hafner},
year={2021},
journal={arXiv preprint arXiv:2109.06780},
}
Contributing
Setup
uv venv craftaxlm-dev
source craftaxlm-dev/bin/activate
uv sync
uv run ruff format .
Help Wanted
- General code quality suggestions or improvements. Especially those that improve speed or reduce tokens.
- PRs to fix issues or add afforances that help your LM agent perform well
- Leaderboard submissions that demonstrate improved performance using algorithms for learning from data
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file craftaxlm-0.0.26.tar.gz.
File metadata
- Download URL: craftaxlm-0.0.26.tar.gz
- Upload date:
- Size: 30.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bd79018f6cf5c7a0509b54c33e674a50e5cdfdf8c00fba39fec1af0ce4700bd4
|
|
| MD5 |
bd92c41063df5af0fcc8c9d9ea40bafc
|
|
| BLAKE2b-256 |
6d9903dece073b1e0da242770767de6989422294ec22f18da872f3fe6bfef631
|
File details
Details for the file craftaxlm-0.0.26-py3-none-any.whl.
File metadata
- Download URL: craftaxlm-0.0.26-py3-none-any.whl
- Upload date:
- Size: 31.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9681c37038c4a2d56052475f615c763f23f11023cde60807045c83bcf06298a
|
|
| MD5 |
e45a6ca43564bc46905559b67c61498c
|
|
| BLAKE2b-256 |
75848c73e17068c9b9fdea2dd2859afc9e9daad2eebbf72fc67d757f35b049a7
|