Skip to main content

CogBench — Verifiable Cognitive Constraint Benchmark for LLM Question Generation

Project description

CogBench

Verifiable Cognitive Constraint Benchmark for LLM Question Generation

CogBench evaluates whether large language models can generate educationally appropriate questions at specific Bloom's Taxonomy cognitive levels while satisfying 28 deterministic constraints.

Features

  • 6 Bloom's Taxonomy Levels: Remember, Understand, Apply, Analyze, Evaluate, Create
  • 8 Academic Subjects: Biology, Chemistry, Physics, Mathematics, Psychology, Economics, History, Computer Science
  • 28 Deterministic Constraints: No LLM-as-judge — every constraint is verifiable and reproducible
  • Adversarial Mode: Tests robustness by using mismatched cognitive-level vocabulary
  • 120 Passages: From OpenStax open-access textbooks

Installation

pip install cogbench

For NLP features (key concept extraction with spaCy):

pip install cogbench[nlp]
python -m spacy download en_core_web_sm

Quick Start

# Show benchmark configuration
cogbench info

# Run benchmark for a single model (requires Ollama)
cogbench run --model qwen2.5:14b --mode standard

# Run all local models in both modes
cogbench run --local-all --mode both

# Re-evaluate existing results with updated constraints
cogbench evaluate

# Generate leaderboard data
cogbench leaderboard

CLI Commands

Command Description
cogbench info Show benchmark config, models, and passage stats
cogbench run Generate questions and evaluate models
cogbench evaluate Re-evaluate existing generation files
cogbench leaderboard Populate leaderboard data.json from results
cogbench scrape Scrape passages from OpenStax textbooks
cogbench submit Submit results to the CogBench leaderboard via GitHub

Submitting Results

After running the benchmark, submit your results to the public leaderboard:

# Submit all completed models
cogbench submit --name "Your Name"

# Submit a specific model
cogbench submit --model qwen2.5:14b --name "Your Name"

This creates a GitHub issue with your results. The maintainers will review and add them to the leaderboard. Requires the GitHub CLI (gh).

Python API

from cogbenchv2.passages.processor import load_all_passages
from cogbenchv2.evaluation.evaluate import evaluate_generations
from cogbenchv2.evaluation.metrics import compute_metrics

# Load the 120 bundled passages
passages = load_all_passages()

# Evaluate a generation file
evaluations = evaluate_generations("gen_qwen2_5_14b_standard.json", passages)

# Compute metrics
metrics = compute_metrics(evaluations)
print(f"Prompt-level strict: {metrics['prompt_level_strict']['rate']:.1%}")

Environment Variables

Variable Default Description
COGBENCH_RESULTS_DIR ./data/results Where benchmark results are saved
OPENAI_API_KEY For OpenAI API models
GOOGLE_API_KEY For Google Gemini models
TOGETHER_API_KEY For Together.ai models

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cogbench-1.1.0.tar.gz (186.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cogbench-1.1.0-py3-none-any.whl (193.0 kB view details)

Uploaded Python 3

File details

Details for the file cogbench-1.1.0.tar.gz.

File metadata

  • Download URL: cogbench-1.1.0.tar.gz
  • Upload date:
  • Size: 186.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cogbench-1.1.0.tar.gz
Algorithm Hash digest
SHA256 3597ff1545e42518f9d5abb88b744fe7a27540c2f530554e6d94627de1c1cfb3
MD5 99012fc47e8c23b4413359c837638a03
BLAKE2b-256 0db9ddc5832af456660d1846e8d14fd0d041749db660d206adc5e51f7cf0c5b9

See more details on using hashes here.

File details

Details for the file cogbench-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: cogbench-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 193.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for cogbench-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e2f6adc696668c75e908e30e59a9a0a5b884b81e880e2eb84bf286304db56be
MD5 9c2fea8e1c53f46ad1614182da7edf59
BLAKE2b-256 52d35e7ba09bbbc8c7f642d109df62d4085074429620e2b532e2129cb9eb310d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page