CheeseBench: An LLM benchmark over 9 rodent behavioral neuroscience paradigms

These details have not been verified by PyPI

Project links

Project description

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

A benchmark for evaluating Large Language Models (LLMs) — and Vision-Language Models (VLMs) when run with image observations — on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines. The default protocol is text-only (ASCII renderings of the environment), so any chat-completion model can be evaluated; vision is supported as an optional input mode.

Key Design Principles

Unified Protocol: Identical system prompt for ALL tasks — no task-specific hints
Published Baselines: Every environment maps to a real rodent experiment with peer-reviewed success rates
Cognitive Taxonomy: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
Multi-Action: The model outputs up to 8 actions per call with explicit learnings/working memory

Quick Start

# Install from PyPI
pip install cheesebench

# Run against any OpenAI-compatible endpoint
cheesebench --model gpt-oss:120b \
    --api-url http://localhost:11434/v1/chat/completions \
    --api-format openai \
    --num-trials 20

# Quick smoke test (1 trial × 1 view mode)
cheesebench --num-trials 1 --view-modes ASCII_2D

# See all options
cheesebench --help

📊 Live leaderboard: https://huggingface.co/spaces/zachz/cheesebench-leaderboard

Development install

git clone https://github.com/stef41/CheeseBench
cd CheeseBench
pip install -e .

# Re-run analysis on your results
python analysis.py results/benchmark_results.json

Project Structure

cheesebench/
├── benchmark.py           # Main benchmark runner (CLI)
├── config.py              # Centralized configuration
├── analysis.py            # Cognitive profiling & analysis pipeline
├── task_definitions.json  # Task specs with paper citations & animal baselines
├── visualize.py           # Publication-quality figures
├── environments/          # 9 behavioral paradigms
│   ├── base_env.py        # Shared engine (rendering, sessions, actions)
│   ├── morris_water_maze.py
│   ├── t_maze.py
│   ├── barnes_maze.py
│   ├── radial_arm_maze.py
│   ├── operant_chamber.py
│   ├── shuttle_box.py
│   ├── place_preference.py
│   ├── star_maze.py
│   └── dnms_task.py
└── README.md

Environments & Cognitive Taxonomy

Environment	Cognitive Dimension	Animal Baseline	Citation
Morris Water Maze	Allocentric Spatial Learning	85% (session 5)	PMC2895266 — Vorhees & Williams 2006
Barnes Maze	Allocentric Spatial Learning	80% (session 5)	PMC6126525 — Vale et al. 2018
T-Maze	Egocentric Nav + Working Memory	80% (session 4)	PMC3399492 — Shoji et al. 2012
Star Maze	Allocentric + Egocentric	80% (session 10)	PMC4112136 — Rondi-Reig et al. 2006
Radial Arm Maze	Working Memory	70% (session 6)	PMC4030456 — Penley et al. 2013
Operant Chamber	Instrumental Conditioning	90% (session 5)	PMC4598097 — Martin & Iceberg 2015
Shuttle Box	Avoidance Learning	70% (session 10)	PMC4692667 — Happel et al. 2015
Place Preference	Associative Learning	75% (session 6)	PMC6101638 — Blanco-Gandía et al. 2018
DNMS Task	Working Memory	80% (session 3)	PMC3982138 — Oomen et al. 2013

View Modes

Mode	Description	Information Content
`ASCII_2D`	Top-down bird's-eye map	Full spatial layout
`ASCII_2D_FPV`	Rotated first-person 2D	Egocentric partial view
`ASCII_3D`	Pseudo-3D ASCII perspective	Depth cues, limited FOV

System Prompt (Unified — Identical for ALL Tasks)

The model receives no task-specific instructions. It must discover the goal from observation and reward feedback alone:

You are an embodied agent placed in a behavioral experiment.
Your only goal is to maximize cumulative reward.

PERCEPTION:
- ASCII rendering (top-down, FPV, or pseudo-3D)
- Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
- Walls (#, █) block movement. Open spaces are traversable.

ACTIONS (egocentric):
- FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY

RESPONSE FORMAT:
LEARNINGS: <working memory — position, strategy, hypotheses>
ACTIONS: <1-8 comma-separated actions>

Analysis Pipeline

The analysis module computes:

Cognitive profiles — radar chart scores across 6 dimensions
Learning curves — rolling-window and block-based success rates
Strategy metrics — action entropy, forward ratio, rotation ratio, repetition rate
Wilson score CIs — 95% confidence intervals on all success rates
Animal comparison — model profiles overlaid with rodent baselines

python analysis.py results/benchmark_results.json
# Outputs: results/benchmark_results_analysis.json

Configuration

All parameters are in config.py and overridable via CLI or environment variables:

export CHEESEBENCH_MODEL=gpt-oss:120b
export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
export CHEESEBENCH_TIMEOUT=120

Parameter	Default	Description
`--model`	`gpt-oss:120b`	Model name (LLM or VLM)
`--num-trials`	20	Trials per environment
`--max-steps`	200	Max steps per trial
`--seed`	42	Random seed
`--output-dir`	`results/`	Output directory
`--quiet`	false	Suppress verbose output

Citation

If you use CheeseBench in your research, please cite:

@inproceedings{cheesebench2025,
  title={CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms},
  author={},
  booktitle={NeurIPS Datasets and Benchmarks Track},
  year={2025}
}

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

May 15, 2026

This version

0.1.2

May 15, 2026

0.1.1

May 14, 2026

0.1.0

May 14, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cheesebench-0.1.2.tar.gz (114.7 kB view details)

Uploaded May 15, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cheesebench-0.1.2-py3-none-any.whl (124.2 kB view details)

Uploaded May 15, 2026 Python 3

File details

Details for the file cheesebench-0.1.2.tar.gz.

File metadata

Download URL: cheesebench-0.1.2.tar.gz
Upload date: May 15, 2026
Size: 114.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cheesebench-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`1e2709c08ecae46a15066ceb522365f9c64785cf87d8aa7875d29537bb0473b2`
MD5	`c9f28f08ad5830e56bf876eae846498c`
BLAKE2b-256	`5f3dcfdf741eee303ba520467bec2e0bd98b6988494566318b0c6b0d6eb55b8e`

See more details on using hashes here.

File details

Details for the file cheesebench-0.1.2-py3-none-any.whl.

File metadata

Download URL: cheesebench-0.1.2-py3-none-any.whl
Upload date: May 15, 2026
Size: 124.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cheesebench-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c3b81f6732fd3320a7ce3f0aad74c620d482a0ec3b70f3f9bbd29caf075abef8`
MD5	`a276071cd1018d7c9d87f99a25a1ac29`
BLAKE2b-256	`4440899aa9f1986f63f90b73693bae749e6013ca1192f0b1158842a9c2f58533`

See more details on using hashes here.

cheesebench 0.1.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

Key Design Principles

Quick Start

Development install

Project Structure

Environments & Cognitive Taxonomy

View Modes

System Prompt (Unified — Identical for ALL Tasks)

Analysis Pipeline

Configuration

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes