Skip to main content

CheeseBench: A VLM benchmark over 9 rodent behavioral neuroscience paradigms

Project description

CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?

A benchmark for evaluating Vision-Language Models (VLMs) on 9 classical behavioral neuroscience paradigms, each grounded in published rodent protocols with quantitative animal baselines.

Key Design Principles

  1. Unified Protocol: Identical system prompt for ALL tasks — no task-specific hints
  2. Published Baselines: Every environment maps to a real rodent experiment with peer-reviewed success rates
  3. Cognitive Taxonomy: 6 cognitive dimensions (spatial learning, navigation, working memory, instrumental conditioning, avoidance learning, associative learning) mapped to neural circuits
  4. Multi-Action: VLM outputs up to 8 actions per call with explicit learnings/working memory

Quick Start

# Install
pip install -r requirements.txt

# Run benchmark (requires an LLM API endpoint)
python benchmark.py --model gpt-oss:120b --num-trials 20

# Quick test (2 trials)
python benchmark.py --num-trials 2

# Custom API endpoint
python benchmark.py --api-url http://localhost:11434/api/chat

# Analyze results
python analysis.py results/benchmark_results.json

Project Structure

cheesebench/
├── benchmark.py           # Main benchmark runner (CLI)
├── config.py              # Centralized configuration
├── analysis.py            # Cognitive profiling & analysis pipeline
├── task_definitions.json  # Task specs with paper citations & animal baselines
├── visualize.py           # Publication-quality figures
├── environments/          # 9 behavioral paradigms
│   ├── base_env.py        # Shared engine (rendering, sessions, actions)
│   ├── morris_water_maze.py
│   ├── t_maze.py
│   ├── barnes_maze.py
│   ├── radial_arm_maze.py
│   ├── operant_chamber.py
│   ├── shuttle_box.py
│   ├── place_preference.py
│   ├── star_maze.py
│   └── dnms_task.py
└── README.md

Environments & Cognitive Taxonomy

Environment Cognitive Dimension Animal Baseline Citation
Morris Water Maze Allocentric Spatial Learning 85% (session 5) PMC2895266 — Vorhees & Williams 2006
Barnes Maze Allocentric Spatial Learning 80% (session 5) PMC6126525 — Vale et al. 2018
T-Maze Egocentric Nav + Working Memory 80% (session 4) PMC3399492 — Shoji et al. 2012
Star Maze Allocentric + Egocentric 80% (session 10) PMC4112136 — Rondi-Reig et al. 2006
Radial Arm Maze Working Memory 70% (session 6) PMC4030456 — Penley et al. 2013
Operant Chamber Instrumental Conditioning 90% (session 5) PMC4598097 — Martin & Iceberg 2015
Shuttle Box Avoidance Learning 70% (session 10) PMC4692667 — Happel et al. 2015
Place Preference Associative Learning 75% (session 6) PMC6101638 — Blanco-Gandía et al. 2018
DNMS Task Working Memory 80% (session 3) PMC3982138 — Oomen et al. 2013

View Modes

Mode Description Information Content
ASCII_2D Top-down bird's-eye map Full spatial layout
ASCII_2D_FPV Rotated first-person 2D Egocentric partial view
ASCII_3D Pseudo-3D ASCII perspective Depth cues, limited FOV

System Prompt (Unified — Identical for ALL Tasks)

The VLM receives no task-specific instructions. It must discover the goal from observation and reward feedback alone:

You are an embodied agent placed in a behavioral experiment.
Your only goal is to maximize cumulative reward.

PERCEPTION:
- ASCII rendering (top-down, FPV, or pseudo-3D)
- Position/orientation shown by arrow: ↑ ↗ → ↘ ↓ ↙ ← ↖
- Walls (#, █) block movement. Open spaces are traversable.

ACTIONS (egocentric):
- FORWARD, ROTATE_LEFT, ROTATE_RIGHT, STAY

RESPONSE FORMAT:
LEARNINGS: <working memory — position, strategy, hypotheses>
ACTIONS: <1-8 comma-separated actions>

Analysis Pipeline

The analysis module computes:

  • Cognitive profiles — radar chart scores across 6 dimensions
  • Learning curves — rolling-window and block-based success rates
  • Strategy metrics — action entropy, forward ratio, rotation ratio, repetition rate
  • Wilson score CIs — 95% confidence intervals on all success rates
  • Animal comparison — VLM profiles overlaid with rodent baselines
python analysis.py results/benchmark_results.json
# Outputs: results/benchmark_results_analysis.json

Configuration

All parameters are in config.py and overridable via CLI or environment variables:

export CHEESEBENCH_MODEL=gpt-oss:120b
export CHEESEBENCH_API_URL=http://localhost:11434/api/chat
export CHEESEBENCH_TIMEOUT=120
Parameter Default Description
--model gpt-oss:120b VLM model name
--num-trials 20 Trials per environment
--max-steps 200 Max steps per trial
--seed 42 Random seed
--output-dir results/ Output directory
--quiet false Suppress verbose output

Citation

If you use CheeseBench in your research, please cite:

@inproceedings{cheesebench2025,
  title={CheeseBench: Do Vision-Language Models Exhibit Rodent-Level Cognition?},
  author={},
  booktitle={NeurIPS Datasets and Benchmarks Track},
  year={2025}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cheesebench-0.1.0.tar.gz (110.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cheesebench-0.1.0-py3-none-any.whl (119.7 kB view details)

Uploaded Python 3

File details

Details for the file cheesebench-0.1.0.tar.gz.

File metadata

  • Download URL: cheesebench-0.1.0.tar.gz
  • Upload date:
  • Size: 110.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cheesebench-0.1.0.tar.gz
Algorithm Hash digest
SHA256 69c0c927564d3f834f3d9002ae6c9780cbcab7721d9ee87d757fda65526aa065
MD5 61af90b00d26bc3af1076f0251e22c8c
BLAKE2b-256 44738bc8159d8d5130688b4add493ddd3ab71cc974fe9a04c054d6af9d084362

See more details on using hashes here.

File details

Details for the file cheesebench-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cheesebench-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 119.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cheesebench-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 531a283e40f94db09a5984595c34e7b94991c4d6aa6fbc6aa26f317c3985a1b0
MD5 bd783bdf4152eb3ff2ca57a592f31699
BLAKE2b-256 aa21d2c4b62d62e62d486ab77b844d0e41e572f9d375653147daa633ff7c521e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page