CheeseBench: A VLM benchmark over 9 rodent behavioral neuroscience paradigms
Project description
CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?
A benchmark for evaluating Vision-Language Models (VLMs) on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines.
Key Design Principles
- Unified Protocol: Identical system prompt for ALL tasks — no task-specific hints
- Published Baselines: Every environment maps to a real rodent experiment with peer-reviewed success rates
- Cognitive Taxonomy: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
- Multi-Action: VLM outputs up to 8 actions per call with explicit learnings/working memory
Quick Start
# Install
pip install -r requirements.txt
# Run benchmark (requires an LLM API endpoint)
python benchmark.py --model gpt-oss:120b --num-trials 20
# Quick test (2 trials)
python benchmark.py --num-trials 2
# Custom API endpoint
python benchmark.py --api-url http://localhost:11434/api/chat
# Analyze results
python analysis.py results/benchmark_results.json
Project Structure
cheesebench/
├── benchmark.py # Main benchmark runner (CLI)
├── config.py # Centralized configuration
├── analysis.py # Cognitive profiling & analysis pipeline
├── task_definitions.json # Task specs with paper citations & animal baselines
├── visualize.py # Publication-quality figures
├── environments/ # 9 behavioral paradigms
│ ├── base_env.py # Shared engine (rendering, sessions, actions)
│ ├── morris_water_maze.py
│ ├── t_maze.py
│ ├── barnes_maze.py
│ ├── radial_arm_maze.py
│ ├── operant_chamber.py
│ ├── shuttle_box.py
│ ├── place_preference.py
│ ├── star_maze.py
│ └── dnms_task.py
└── README.md
Environments & Cognitive Taxonomy
| Environment | Cognitive Dimension | Animal Baseline | Citation |
|---|---|---|---|
| Morris Water Maze | Allocentric Spatial Learning | 85% (session 5) | PMC2895266 — Vorhees & Williams 2006 |
| Barnes Maze | Allocentric Spatial Learning | 80% (session 5) | PMC6126525 — Vale et al. 2018 |
| T-Maze | Egocentric Nav + Working Memory | 80% (session 4) | PMC3399492 — Shoji et al. 2012 |
| Star Maze | Allocentric + Egocentric | 80% (session 10) | PMC4112136 — Rondi-Reig et al. 2006 |
| Radial Arm Maze | Working Memory | 70% (session 6) | PMC4030456 — Penley et al. 2013 |
| Operant Chamber | Instrumental Conditioning | 90% (session 5) | PMC4598097 — Martin & Iceberg 2015 |
| Shuttle Box | Avoidance Learning | 70% (session 10) | PMC4692667 — Happel et al. 2015 |
| Place Preference | Associative Learning | 75% (session 6) | PMC6101638 — Blanco-Gandía et al. 2018 |
| DNMS Task | Working Memory | 80% (session 3) | PMC3982138 — Oomen et al. 2013 |
View Modes
| Mode | Description | Information Content |
|---|---|---|
ASCII_2D |
Top-down bird's-eye map | Full spatial layout |
ASCII_2D_FPV |
Rotated first-person 2D | Egocentric partial view |
ASCII_3D |
Pseudo-3D ASCII perspective | Depth cues, limited FOV |
System Prompt (Unified — Identical for ALL Tasks)
The VLM receives no task-specific instructions. It must discover the goal from observation and reward feedback alone:
You are an embodied agent placed in a behavioral experiment.
Your only goal is to maximize cumulative reward.
PERCEPTION:
- ASCII rendering (top-down, FPV, or pseudo-3D)
- Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
- Walls (#, █) block movement. Open spaces are traversable.
ACTIONS (egocentric):
- FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY
RESPONSE FORMAT:
LEARNINGS: <working memory — position, strategy, hypotheses>
ACTIONS: <1-8 comma-separated actions>
Analysis Pipeline
The analysis module computes:
- Cognitive profiles — radar chart scores across 6 dimensions
- Learning curves — rolling-window and block-based success rates
- Strategy metrics — action entropy, forward ratio, rotation ratio, repetition rate
- Wilson score CIs — 95% confidence intervals on all success rates
- Animal comparison — VLM profiles overlaid with rodent baselines
python analysis.py results/benchmark_results.json
# Outputs: results/benchmark_results_analysis.json
Configuration
All parameters are in config.py and overridable via CLI or environment variables:
export CHEESEBENCH_MODEL=gpt-oss:120b
export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
export CHEESEBENCH_TIMEOUT=120
| Parameter | Default | Description |
|---|---|---|
--model |
gpt-oss:120b |
VLM model name |
--num-trials |
20 | Trials per environment |
--max-steps |
200 | Max steps per trial |
--seed |
42 | Random seed |
--output-dir |
results/ |
Output directory |
--quiet |
false | Suppress verbose output |
Citation
If you use CheeseBench in your research, please cite:
@inproceedings{cheesebench2025,
title={CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?},
author={},
booktitle={NeurIPS Datasets and Benchmarks Track},
year={2025}
}
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cheesebench-0.1.0.tar.gz.
File metadata
- Download URL: cheesebench-0.1.0.tar.gz
- Upload date:
- Size: 110.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
69c0c927564d3f834f3d9002ae6c9780cbcab7721d9ee87d757fda65526aa065
|
|
| MD5 |
61af90b00d26bc3af1076f0251e22c8c
|
|
| BLAKE2b-256 |
44738bc8159d8d5130688b4add493ddd3ab71cc974fe9a04c054d6af9d084362
|
File details
Details for the file cheesebench-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cheesebench-0.1.0-py3-none-any.whl
- Upload date:
- Size: 119.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
531a283e40f94db09a5984595c34e7b94991c4d6aa6fbc6aa26f317c3985a1b0
|
|
| MD5 |
bd783bdf4152eb3ff2ca57a592f31699
|
|
| BLAKE2b-256 |
aa21d2c4b62d62e62d486ab77b844d0e41e572f9d375653147daa633ff7c521e
|