CheeseBench: An LLM benchmark over 9 rodent behavioral neuroscience paradigms
Project description
CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
A benchmark for evaluating Large Language Models (LLMs) — and Vision-Language Models (VLMs) when run with image observations — on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines. The default protocol is text-only (ASCII renderings of the environment), so any chat-completion model can be evaluated; vision is supported as an optional input mode.
Key Design Principles
- Unified Protocol: Identical system prompt for ALL tasks — no task-specific hints
- Published Baselines: Every environment maps to a real rodent experiment with peer-reviewed success rates
- Cognitive Taxonomy: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
- Multi-Action: The model outputs up to 8 actions per call with explicit learnings/working memory
Quick Start
# Install from PyPI
pip install cheesebench
# Run against any OpenAI-compatible endpoint
cheesebench --model gpt-oss:120b \
--api-url http://localhost:11434/v1/chat/completions \
--api-format openai \
--num-trials 20
# Quick smoke test (1 trial × 1 view mode)
cheesebench --num-trials 1 --view-modes ASCII_2D
# See all options
cheesebench --help
📊 Live leaderboard: https://huggingface.co/spaces/zachz/cheesebench-leaderboard
Development install
git clone https://github.com/stef41/CheeseBench
cd CheeseBench
pip install -e .
# Re-run analysis on your results
python analysis.py results/benchmark_results.json
Project Structure
cheesebench/
├── benchmark.py # Main benchmark runner (CLI)
├── config.py # Centralized configuration
├── analysis.py # Cognitive profiling & analysis pipeline
├── task_definitions.json # Task specs with paper citations & animal baselines
├── visualize.py # Publication-quality figures
├── environments/ # 9 behavioral paradigms
│ ├── base_env.py # Shared engine (rendering, sessions, actions)
│ ├── morris_water_maze.py
│ ├── t_maze.py
│ ├── barnes_maze.py
│ ├── radial_arm_maze.py
│ ├── operant_chamber.py
│ ├── shuttle_box.py
│ ├── place_preference.py
│ ├── star_maze.py
│ └── dnms_task.py
└── README.md
Environments & Cognitive Taxonomy
| Environment | Cognitive Dimension | Animal Baseline | Citation |
|---|---|---|---|
| Morris Water Maze | Allocentric Spatial Learning | 85% (session 5) | PMC2895266 — Vorhees & Williams 2006 |
| Barnes Maze | Allocentric Spatial Learning | 80% (session 5) | PMC6126525 — Vale et al. 2018 |
| T-Maze | Egocentric Nav + Working Memory | 80% (session 4) | PMC3399492 — Shoji et al. 2012 |
| Star Maze | Allocentric + Egocentric | 80% (session 10) | PMC4112136 — Rondi-Reig et al. 2006 |
| Radial Arm Maze | Working Memory | 70% (session 6) | PMC4030456 — Penley et al. 2013 |
| Operant Chamber | Instrumental Conditioning | 90% (session 5) | PMC4598097 — Martin & Iceberg 2015 |
| Shuttle Box | Avoidance Learning | 70% (session 10) | PMC4692667 — Happel et al. 2015 |
| Place Preference | Associative Learning | 75% (session 6) | PMC6101638 — Blanco-Gandía et al. 2018 |
| DNMS Task | Working Memory | 80% (session 3) | PMC3982138 — Oomen et al. 2013 |
View Modes
| Mode | Description | Information Content |
|---|---|---|
ASCII_2D |
Top-down bird's-eye map | Full spatial layout |
ASCII_2D_FPV |
Rotated first-person 2D | Egocentric partial view |
ASCII_3D |
Pseudo-3D ASCII perspective | Depth cues, limited FOV |
System Prompt (Unified — Identical for ALL Tasks)
The model receives no task-specific instructions. It must discover the goal from observation and reward feedback alone:
You are an embodied agent placed in a behavioral experiment.
Your only goal is to maximize cumulative reward.
PERCEPTION:
- ASCII rendering (top-down, FPV, or pseudo-3D)
- Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
- Walls (#, █) block movement. Open spaces are traversable.
ACTIONS (egocentric):
- FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY
RESPONSE FORMAT:
LEARNINGS: <working memory — position, strategy, hypotheses>
ACTIONS: <1-8 comma-separated actions>
Analysis Pipeline
The analysis module computes:
- Cognitive profiles — radar chart scores across 6 dimensions
- Learning curves — rolling-window and block-based success rates
- Strategy metrics — action entropy, forward ratio, rotation ratio, repetition rate
- Wilson score CIs — 95% confidence intervals on all success rates
- Animal comparison — model profiles overlaid with rodent baselines
python analysis.py results/benchmark_results.json
# Outputs: results/benchmark_results_analysis.json
Configuration
All parameters are in config.py and overridable via CLI or environment variables:
export CHEESEBENCH_MODEL=gpt-oss:120b
export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
export CHEESEBENCH_TIMEOUT=120
| Parameter | Default | Description |
|---|---|---|
--model |
gpt-oss:120b |
Model name (LLM or VLM) |
--num-trials |
20 | Trials per environment |
--max-steps |
200 | Max steps per trial |
--seed |
42 | Random seed |
--output-dir |
results/ |
Output directory |
--quiet |
false | Suppress verbose output |
Citation
If you use CheeseBench in your research, please cite:
@inproceedings{cheesebench2025,
title={CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms},
author={},
booktitle={NeurIPS Datasets and Benchmarks Track},
year={2025}
}
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cheesebench-0.1.2.tar.gz.
File metadata
- Download URL: cheesebench-0.1.2.tar.gz
- Upload date:
- Size: 114.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e2709c08ecae46a15066ceb522365f9c64785cf87d8aa7875d29537bb0473b2
|
|
| MD5 |
c9f28f08ad5830e56bf876eae846498c
|
|
| BLAKE2b-256 |
5f3dcfdf741eee303ba520467bec2e0bd98b6988494566318b0c6b0d6eb55b8e
|
File details
Details for the file cheesebench-0.1.2-py3-none-any.whl.
File metadata
- Download URL: cheesebench-0.1.2-py3-none-any.whl
- Upload date:
- Size: 124.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c3b81f6732fd3320a7ce3f0aad74c620d482a0ec3b70f3f9bbd29caf075abef8
|
|
| MD5 |
a276071cd1018d7c9d87f99a25a1ac29
|
|
| BLAKE2b-256 |
4440899aa9f1986f63f90b73693bae749e6013ca1192f0b1158842a9c2f58533
|