Skip to main content

CheeseBench: An LLM benchmark over 9 rodent behavioral neuroscience paradigms

Project description

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

A benchmark for evaluating Large Language Models (LLMs) — and Vision-Language Models (VLMs) when run with image observations — on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines. The default protocol is text-only (ASCII renderings of the environment), so any chat-completion model can be evaluated; vision is supported as an optional input mode.

Key Design Principles

  1. Unified Protocol: Identical system prompt for ALL tasks — no task-specific hints
  2. Published Baselines: Every environment maps to a real rodent experiment with peer-reviewed success rates
  3. Cognitive Taxonomy: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
  4. Multi-Action: The model outputs up to 8 actions per call with explicit learnings/working memory

Quick Start

# Install from PyPI
pip install cheesebench

# Run against any OpenAI-compatible endpoint
cheesebench --model gpt-oss:120b \
    --api-url http://localhost:11434/v1/chat/completions \
    --api-format openai \
    --num-trials 20

# Quick smoke test (1 trial × 1 view mode)
cheesebench --num-trials 1 --view-modes ASCII_2D

# See all options
cheesebench --help

📊 Live leaderboard: https://huggingface.co/spaces/zachz/cheesebench-leaderboard

Development install

git clone https://github.com/stef41/CheeseBench
cd CheeseBench
pip install -e .

# Re-run analysis on your results
python analysis.py results/benchmark_results.json

Project Structure

cheesebench/
├── benchmark.py           # Main benchmark runner (CLI)
├── config.py              # Centralized configuration
├── analysis.py            # Cognitive profiling & analysis pipeline
├── task_definitions.json  # Task specs with paper citations & animal baselines
├── visualize.py           # Publication-quality figures
├── environments/          # 9 behavioral paradigms
│   ├── base_env.py        # Shared engine (rendering, sessions, actions)
│   ├── morris_water_maze.py
│   ├── t_maze.py
│   ├── barnes_maze.py
│   ├── radial_arm_maze.py
│   ├── operant_chamber.py
│   ├── shuttle_box.py
│   ├── place_preference.py
│   ├── star_maze.py
│   └── dnms_task.py
└── README.md

Environments & Cognitive Taxonomy

Environment Cognitive Dimension Animal Baseline Citation
Morris Water Maze Allocentric Spatial Learning 85% (session 5) PMC2895266 — Vorhees & Williams 2006
Barnes Maze Allocentric Spatial Learning 80% (session 5) PMC6126525 — Vale et al. 2018
T-Maze Egocentric Nav + Working Memory 80% (session 4) PMC3399492 — Shoji et al. 2012
Star Maze Allocentric + Egocentric 80% (session 10) PMC4112136 — Rondi-Reig et al. 2006
Radial Arm Maze Working Memory 70% (session 6) PMC4030456 — Penley et al. 2013
Operant Chamber Instrumental Conditioning 90% (session 5) PMC4598097 — Martin & Iceberg 2015
Shuttle Box Avoidance Learning 70% (session 10) PMC4692667 — Happel et al. 2015
Place Preference Associative Learning 75% (session 6) PMC6101638 — Blanco-Gandía et al. 2018
DNMS Task Working Memory 80% (session 3) PMC3982138 — Oomen et al. 2013

View Modes

Mode Description Information Content
ASCII_2D Top-down bird's-eye map Full spatial layout
ASCII_2D_FPV Rotated first-person 2D Egocentric partial view
ASCII_3D Pseudo-3D ASCII perspective Depth cues, limited FOV

System Prompt (Unified — Identical for ALL Tasks)

The model receives no task-specific instructions. It must discover the goal from observation and reward feedback alone:

You are an embodied agent placed in a behavioral experiment.
Your only goal is to maximize cumulative reward.

PERCEPTION:
- ASCII rendering (top-down, FPV, or pseudo-3D)
- Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
- Walls (#, █) block movement. Open spaces are traversable.

ACTIONS (egocentric):
- FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY

RESPONSE FORMAT:
LEARNINGS: <working memory — position, strategy, hypotheses>
ACTIONS: <1-8 comma-separated actions>

Analysis Pipeline

The analysis module computes:

  • Cognitive profiles — radar chart scores across 6 dimensions
  • Learning curves — rolling-window and block-based success rates
  • Strategy metrics — action entropy, forward ratio, rotation ratio, repetition rate
  • Wilson score CIs — 95% confidence intervals on all success rates
  • Animal comparison — model profiles overlaid with rodent baselines
python analysis.py results/benchmark_results.json
# Outputs: results/benchmark_results_analysis.json

Configuration

All parameters are in config.py and overridable via CLI or environment variables:

export CHEESEBENCH_MODEL=gpt-oss:120b
export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
export CHEESEBENCH_TIMEOUT=120
Parameter Default Description
--model gpt-oss:120b Model name (LLM or VLM)
--num-trials 20 Trials per environment
--max-steps 200 Max steps per trial
--seed 42 Random seed
--output-dir results/ Output directory
--quiet false Suppress verbose output

Citation

If you use CheeseBench in your research, please cite:

@inproceedings{cheesebench2025,
  title={CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms},
  author={},
  booktitle={NeurIPS Datasets and Benchmarks Track},
  year={2025}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cheesebench-0.2.0.tar.gz (114.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cheesebench-0.2.0-py3-none-any.whl (125.0 kB view details)

Uploaded Python 3

File details

Details for the file cheesebench-0.2.0.tar.gz.

File metadata

  • Download URL: cheesebench-0.2.0.tar.gz
  • Upload date:
  • Size: 114.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cheesebench-0.2.0.tar.gz
Algorithm Hash digest
SHA256 44dcffa41e5602ba1b3ba69ad777054ea638eea46ff2ef5eef9d22eb8abc463a
MD5 758c389ef9ce9158a089f73f41d38d6c
BLAKE2b-256 37636abd98dc7b201e2071d60fcdca141636a5dc86b8860dbc015ac0a626b5e8

See more details on using hashes here.

File details

Details for the file cheesebench-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: cheesebench-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 125.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cheesebench-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8293f416646d4387bba34d2a1ddcc94542b32522dea32936e66fcbd175a3e8b2
MD5 29b37fb4dbf86073e8aaf38b3e69766c
BLAKE2b-256 0117490969f63751dda5b8fa4c75681acb0eee756ac9adcc0c348165abf3817c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page