Skip to main content

CogBench — Verifiable Cognitive Constraint Benchmark for LLM Question Generation

Project description

CogBench

Verifiable Cognitive Constraint Benchmark for LLM Question Generation

CogBench evaluates whether large language models can generate educationally appropriate questions at specific Bloom's Taxonomy cognitive levels while satisfying 28 deterministic constraints.

Features

  • 6 Bloom's Taxonomy Levels: Remember, Understand, Apply, Analyze, Evaluate, Create
  • 8 Academic Subjects: Biology, Chemistry, Physics, Mathematics, Psychology, Economics, History, Computer Science
  • 28 Deterministic Constraints: No LLM-as-judge — every constraint is verifiable and reproducible
  • Adversarial Mode: Tests robustness by using mismatched cognitive-level vocabulary
  • 120 Passages: From OpenStax open-access textbooks

Installation

pip install cogbench

For NLP features (key concept extraction with spaCy):

pip install cogbench[nlp]
python -m spacy download en_core_web_sm

Quick Start

# Show benchmark configuration
cogbench info

# Run benchmark for a single model (requires Ollama)
cogbench run --model qwen2.5:14b --mode standard

# Run all local models in both modes
cogbench run --local-all --mode both

# Re-evaluate existing results with updated constraints
cogbench evaluate

# Generate leaderboard data
cogbench leaderboard

CLI Commands

Command Description
cogbench info Show benchmark config, models, and passage stats
cogbench run Generate questions and evaluate models
cogbench evaluate Re-evaluate existing generation files
cogbench leaderboard Populate leaderboard data.json from results
cogbench scrape Scrape passages from OpenStax textbooks
cogbench submit Submit results to the CogBench leaderboard via GitHub

Submitting Results

After running the benchmark, submit your results to the public leaderboard:

# Submit all completed models
cogbench submit --name "Your Name"

# Submit a specific model
cogbench submit --model qwen2.5:14b --name "Your Name"

This creates a GitHub issue with your results. The maintainers will review and add them to the leaderboard. Requires the GitHub CLI (gh).

Python API

from cogbenchv2.passages.processor import load_all_passages
from cogbenchv2.evaluation.evaluate import evaluate_generations
from cogbenchv2.evaluation.metrics import compute_metrics

# Load the 120 bundled passages
passages = load_all_passages()

# Evaluate a generation file
evaluations = evaluate_generations("gen_qwen2_5_14b_standard.json", passages)

# Compute metrics
metrics = compute_metrics(evaluations)
print(f"Prompt-level strict: {metrics['prompt_level_strict']['rate']:.1%}")

Environment Variables

Variable Default Description
COGBENCH_RESULTS_DIR ./data/results Where benchmark results are saved
OPENAI_API_KEY For OpenAI API models
GOOGLE_API_KEY For Google Gemini models
TOGETHER_API_KEY For Together.ai models

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogbench-1.0.0.tar.gz (183.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogbench-1.0.0-py3-none-any.whl (190.4 kB view details)

Uploaded Python 3

File details

Details for the file cogbench-1.0.0.tar.gz.

File metadata

  • Download URL: cogbench-1.0.0.tar.gz
  • Upload date:
  • Size: 183.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cogbench-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a8fc3a3f0f8339b104dc1e54fe07c9516d604684f42f218c684e70efae400085
MD5 e67a72d5081344d4585183e1267b1d3f
BLAKE2b-256 fbd9d2a997893eda6c8599953a0c33dbddc01ad7fa21dc3f6ad96a7193db6ba2

See more details on using hashes here.

File details

Details for the file cogbench-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: cogbench-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 190.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cogbench-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8565dad7fe491956dd95e1f07a32717d8d090785e6c7ccc2855c5c39cadd4d18
MD5 af9aa6196ecba019b718764f46047014
BLAKE2b-256 b06e90e17c8c6db8d20c414773771395971a6f36be3c7ddafdfa2b7c2631278e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page