Skip to main content

Behavioral metacognition benchmark under genuine inter-model disagreement

Project description

MEDLEY-BENCH

Package: v0.5.0 (beta) · Dataset: v1.0

Behavioral Metacognition Under Social Pressure

License Python 3.10+ PyPI Dataset Models

MEDLEY-BENCH measures behavioural metacognition in large language models -- the capacity to monitor, evaluate, and control one's own reasoning under escalating social-epistemic pressure. Unlike accuracy-focused benchmarks, MEDLEY-BENCH measures how models behave when challenged, not whether they know the answer.

⚠️ Beta release. The medley-bench package is published as v0.5.0 (beta): APIs, prompts, and scoring weights may change before the stable 1.0 line. The dataset is frozen at v1.0 and is reproducible as released.

⏱️ Expect long runs. A single model on the full 130-instance dataset issues several hundred API calls (3 target calls/instance × 130 = 390, plus 130 judge calls = 520 total). Wall-clock time depends entirely on provider latency: ~1 hour on fast hosted APIs (Gemini Flash, Claude Haiku, GPT-4.1-mini, or Ollama cloud), several hours on slower ones, and many hours on local Ollama with mid-size open-weight models (Step B-Social alone runs 2–3 min/instance on a 4B-class local model). Plan accordingly — the runner saves results incrementally and is resumable.


Installation

pip install medley-bench

Supported Providers

Model ID pattern Provider Example
claude-* Anthropic (direct) claude-haiku-4.5
gpt-*, o1-*, o3-* OpenAI (direct) gpt-4.1, gpt-5.4-mini
gemini-* Google (direct) gemini-2.5-flash
ollama/model Ollama (local or cloud) ollama/gemma3:12b, ollama/gpt-oss:20b-cloud
org/model OpenRouter anthropic/claude-haiku-4.5

Set the corresponding API key as an environment variable (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, OPENROUTER_API_KEY). Ollama requires no key for local models.

Quick Start

Run benchmark on a model

# Cloud model via OpenRouter (one API key for all providers)
export OPENROUTER_API_KEY="sk-or-..."
medley-bench benchmark --models "anthropic/claude-haiku-4.5" --data data/metacognition/v1.0/

# Local Ollama model
medley-bench benchmark --models "ollama/gemma3:12b" --data data/metacognition/v1.0/

Run benchmark with a live judge

By default, the benchmark scores only the deterministic measures (T1 + most of T2). To also score the judge-dependent measures (T3), pass a judge model:

# Recommended judge: Gemini 2.5 Flash (fast, cheap, excellent structured output)
export GOOGLE_API_KEY="AI..."
medley-bench benchmark \
  --models "ollama/gemma3:12b" \
  --data data/metacognition/v1.0/ \
  --judge-model gemini-2.5-flash \
  --judge-base-url https://generativelanguage.googleapis.com/v1beta/openai/ \
  --judge-api-key $GOOGLE_API_KEY

# Fully offline: use an Ollama cloud model as judge
medley-bench benchmark \
  --models "ollama/gemma3:12b" \
  --data data/metacognition/v1.0/ \
  --judge-model gemma4:31b-cloud \
  --judge-base-url http://localhost:11434/v1

# Smoke test: limit to first N instances per domain
medley-bench benchmark \
  --models "ollama/gemma3:12b" \
  --data data/metacognition/v1.0/ \
  --judge-model gemini-2.5-flash \
  --n-instances 3

Any OpenAI-compatible endpoint works as a judge. Reasoning models (gpt-oss, glm-4.6, DeepSeek v3.1, etc.) are supported — the library reads the reasoning / reasoning_content fields when content is empty.

View leaderboard

medley-bench leaderboard --results results/

Note on lm-eval-harness

MEDLEY-BENCH cannot run as a native lm-eval-harness task because the three-step protocol requires sequential, state-dependent API calls (Step B's prompt depends on Step A's output). This is a common limitation for multi-turn behavioural benchmarks — AlpacaEval, MT-Bench, and Arena-Hard use the same approach.


Three-Step Decomposition

Every benchmark instance runs three model calls in isolated contexts:

Step What the model sees What it isolates
Step A (Solo) Problem vignette only Independent analysis + confidence calibration
Step B-Private Own Step A + self-review nudge Self-revision capacity
Step B-Social 8 analyst opinions + consensus Social updating quality
Delta(A -> B-Private)  = self-revision
Delta(B-Private -> B-Social) = pure social influence

Scoring Framework

Scores

Score What it measures Composition
MMS (Medley Metacognition Score) Articulation quality T1 Reflective Updating + T2 Social Robustness + T3 Epistemic Articulation (equal weights)
MAS (Medley Ability Score) Behavioural competence Mean of Monitoring, Control, Evaluation, Self-regulation

Three-Tier Aggregation

Tier Weight Measures Method
T1: Reflective Updating 33% Proportionality, selectivity, volatility, uncertainty localisation, Brier change Deterministic
T2: Social Robustness 33% Private-vs-social delta, epistemic cowardice, resistance appropriateness, majority pressure, capitulation quality, normative/informational Mixed
T3: Epistemic Articulation 33% Content engagement, steelmanning, argument specificity, synthesis necessity, attribution depth, intellectual courage, error acknowledgement + 6 more Mixed

75% of scoring weight is deterministic (rule-based behavioural deltas). 25% uses an LLM judge with anti-rhetoric rubric.

Anti-Gaming Controls

  • Consensus masking (directional labels, not raw numbers)
  • Anonymised analysts in prompts
  • 30 known-answer instances with verified-wrong claims
  • Per-claim ground truth from consensus verification
  • Circularity-aware judge rotation (no model judged by own family)

Dataset

130 instances across 5 domains:

Domain Instances Reasoning type
Medical Diagnosis 27 Evidential -- contradictory clinical evidence
System Troubleshooting 26 Causal -- root cause through layers
Code Review 27 Contextual -- severity depends on threat model
Architecture Design 25 Tradeoff -- no single right answer
Statistical Reasoning 25 Formal -- same data, different frameworks

Each instance includes a vignette, 5 claims with disagreement scores, 8 analyst responses (from 28-model pool), jackknife consensus, and per-claim verified-wrong labels.

The dataset is also available on Kaggle: farhadabtahi/medley-bench-data

Benchmark Modes

Normal Mode (Kaggle-compatible)

3 calls per instance x 130 = 390 API calls. Standard three-step protocol.

Progressive Mode (5-stage stress test)

Stage Analysts Instances Purpose
Baseline 0 130 Solo calibration
Mild 2 130 Basic social responsiveness
Moderate 4 130 Proportional updating
Strong 6 50 Argument discrimination
Adversarial 6 (wrong consensus) 30 Intellectual courage under max pressure

Kaggle vs Local Scoring

The Kaggle competition framework (kbench) imposes limitations on judge scoring compared to the local benchmark:

Feature Local Benchmark Kaggle (kbench)
Judge scale Graded 0-3 per sub-criterion Binary pass/fail
Family exclusion No model judged by own family Not available
T3 resolution Fine-grained (30 sub-criteria x 4 levels) Compressed (30 x 2 levels)
Score offset Reference +2-4 pts higher (compressed T3)
Rank correlation Reference rho > 0.97 (rankings preserved)

Why rankings are preserved: 75% of MMS comes from deterministic rule-based measures (T1 + T2) that are identical on both platforms. The judge limitations only affect T3 (25% of score).

Recommendation: Use the local benchmark (pip install medley-bench) for research. Use Kaggle notebooks for competition submission and quick model comparison.

Results: 35 Models

Rank Model MMS MAS T1 T2 T3
1 Claude Haiku 4.5 62.2 61.8 61.1 56.3 69.2
2 Gemma 3 27B 61.1 62.0 60.1 55.8 67.5
3 Qwen 3.5 397B 61.0 59.2 59.8 56.5 66.7
4 Gemini 3 Flash 60.7 60.2 59.5 56.0 66.5
5 Claude Sonnet 4.5 60.4 60.7 59.3 55.7 66.3
6 Gemma 3 12B 60.1 61.5 58.9 55.5 65.9

Full results for all 35 models are available in the GitHub repository.

Key Findings

  1. Scale buys Evaluation, not Control. Evaluation ability scales with model size, but Control (social robustness) shows no scaling -- GPT-4.1-Nano achieves the best T2 score.

  2. Argument-evaluators vs. statistics-followers. Two behavioural profiles invisible to standard benchmarks, predicted by a single judge dimension (Normative/Informational, rho = -0.82).

  3. Universal evaluation deficit. Under ipsative scoring, Evaluation is every model's weakest relative ability.

  4. Non-monotonic scale returns. Gemma family: 4B(30) -> 9B(50) -> 12B(60) -> 27B(61) -> Gen4-31B(57).

Citation

@article{abtahi2026medleybench,
  title={MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition},
  author={Abtahi, Farhad and Karbalaie, Abdolamir and Illueca-Fernandez, Eduardo and Seoane, Fernando},
  year={2026},
  note={Preprint}
}

License

Apache License 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

medley_bench-0.5.2.tar.gz (144.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

medley_bench-0.5.2-py3-none-any.whl (164.3 kB view details)

Uploaded Python 3

File details

Details for the file medley_bench-0.5.2.tar.gz.

File metadata

  • Download URL: medley_bench-0.5.2.tar.gz
  • Upload date:
  • Size: 144.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.9

File hashes

Hashes for medley_bench-0.5.2.tar.gz
Algorithm Hash digest
SHA256 8a2a549b4c24b2bea8999a993df52816a87d2ca86c3adf4ce56ab158f99ea399
MD5 e94d1db14ffd093b2c6b5ca1df18d2c3
BLAKE2b-256 86ef3522aaa7202cdf6c430fc48142d5146b92e8d5957cbff0071c44332fb5c3

See more details on using hashes here.

File details

Details for the file medley_bench-0.5.2-py3-none-any.whl.

File metadata

  • Download URL: medley_bench-0.5.2-py3-none-any.whl
  • Upload date:
  • Size: 164.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.9

File hashes

Hashes for medley_bench-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e7aa33ff8098d37199544c07eafa6075f571be98af43842abbac7b0673eefb17
MD5 b2fdbdaf8a6af335c36d4de1e4fb6a81
BLAKE2b-256 d837b9f6179b94cac7ec015086b12caa90be3d6ac4a5ec8f4a2118ab860f43a7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page