Skip to main content

Add your description here

Project description

Craftax LM

A wrapper around the Craftax agent benchmark, for evaluating digital agents over extremely long time horizons.

Craftax-Classic

LM Algorithm Score (% max) Code
claude-3-7-sonnet-latest (default) ReAct 18.0
claude-3-5-sonnet-20241022 ReAct 17.8
claude-3-5-sonnet-20240620 ReAct 15.7
o3-mini ReAct 12.6
gpt-4o ReAct 7.0
  • Note - this is a limited evaluation where trajectories are terminated after 30 api calls, or roughly 150 in-game steps. 10 trajectories are rolled-out, yielding a log-weighted score as per the Crafter paper. Reproducible code forthcoming.

Usage

First, download the package with pip install craftaxlm. Next, import the agent-computer interface of your choice via

from craftaxlm import CraftaxACI, CraftaxClassicACI

This package is early in development, so for implementation examples, please refer to the baseline ReAct implementation

Leaderboard

In order to make experiments reasonable to run across a range of LMs, currently the leaderboard evaluates agents in the following manner:

  1. Five rollouts are sampled from the agent, with a hard cap of 300 actions per rollout.
  2. The agent is evaluated using a modified version of the original Crafter score -
    sum(ln(1 + P(1_achievement_obtained)) for achievement in achievements) / (sum(ln(2) * len(achievements)))
    
    where P(1_achievement_obtained) is the probability of the achievement being obtained in a single rollout. The key idea is that incremental progress towards difficult achievements ought to weigh more heavily in the score.

Craftax-Full

LM Algorithm Score (% max) Code

Dev Instructions

pyenv virtualenv craftax_env
poetry install

When in doubt

from jax import debug
...
debug.breakpoint()

📚 Citation

To learn more about Craftax, check out the paper website here. To cite the underlying Craftax environment, see:

@inproceedings{matthews2024craftax,
    author={Michael Matthews and Michael Beukman and Benjamin Ellis and Mikayel Samvelyan and Matthew Jackson and Samuel Coward and Jakob Foerster},
    title = {Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning},
    booktitle = {International Conference on Machine Learning ({ICML})},
    year = {2024}
}

To cite the Crafter benchmark, see:

@article{hafner2021crafter,
  title={Benchmarking the Spectrum of Agent Capabilities},
  author={Danijar Hafner},
  year={2021},
  journal={arXiv preprint arXiv:2109.06780},
}

Contributing

Setup

uv venv craftaxlm-dev
source craftaxlm-dev/bin/activate
uv sync
uv run ruff format .

Help Wanted

  • General code quality suggestions or improvements. Especially those that improve speed or reduce tokens.
  • PRs to fix issues or add afforances that help your LM agent perform well
  • Leaderboard submissions that demonstrate improved performance using algorithms for learning from data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

craftaxlm-0.0.30.tar.gz (31.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

craftaxlm-0.0.30-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file craftaxlm-0.0.30.tar.gz.

File metadata

  • Download URL: craftaxlm-0.0.30.tar.gz
  • Upload date:
  • Size: 31.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.8

File hashes

Hashes for craftaxlm-0.0.30.tar.gz
Algorithm Hash digest
SHA256 3b885ef31cd32acc9ba8e389d86df323da1c992c2da047d86cc65f0674526f1e
MD5 3933160653f508025027a813f770c957
BLAKE2b-256 8b0d2b30f621203bc1c7c26cbc52bba564e7a18744bc302be9f12ace654d0bda

See more details on using hashes here.

File details

Details for the file craftaxlm-0.0.30-py3-none-any.whl.

File metadata

  • Download URL: craftaxlm-0.0.30-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.8

File hashes

Hashes for craftaxlm-0.0.30-py3-none-any.whl
Algorithm Hash digest
SHA256 edbd3590c0bcc8088260acf9606cbbc3a588970f301835ef784e70c54bc2bc4c
MD5 d85d0445d72fd36e7a20d240c94f5513
BLAKE2b-256 7e36022a31c6a4ca7c20c4edabdfe89c2dc02eb4ab9b92baf5cc3ab44fddb652

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page