CogBench — Verifiable Cognitive Constraint Benchmark for LLM Question Generation
Project description
CogBench
Verifiable Cognitive Constraint Benchmark for LLM Question Generation
CogBench evaluates whether large language models can generate educationally appropriate questions at specific Bloom's Taxonomy cognitive levels while satisfying 28 deterministic constraints.
Features
- 6 Bloom's Taxonomy Levels: Remember, Understand, Apply, Analyze, Evaluate, Create
- 8 Academic Subjects: Biology, Chemistry, Physics, Mathematics, Psychology, Economics, History, Computer Science
- 28 Deterministic Constraints: No LLM-as-judge — every constraint is verifiable and reproducible
- Adversarial Mode: Tests robustness by using mismatched cognitive-level vocabulary
- 120 Passages: From OpenStax open-access textbooks
Installation
pip install cogbench
For NLP features (key concept extraction with spaCy):
pip install cogbench[nlp]
python -m spacy download en_core_web_sm
Quick Start
# Show benchmark configuration
cogbench info
# Run benchmark for a single model (requires Ollama)
cogbench run --model qwen2.5:14b --mode standard
# Run all local models in both modes
cogbench run --local-all --mode both
# Re-evaluate existing results with updated constraints
cogbench evaluate
# Generate leaderboard data
cogbench leaderboard
CLI Commands
| Command | Description |
|---|---|
cogbench info |
Show benchmark config, models, and passage stats |
cogbench run |
Generate questions and evaluate models |
cogbench evaluate |
Re-evaluate existing generation files |
cogbench leaderboard |
Populate leaderboard data.json from results |
cogbench scrape |
Scrape passages from OpenStax textbooks |
cogbench submit |
Submit results to the CogBench leaderboard via GitHub |
Submitting Results
After running the benchmark, submit your results to the public leaderboard:
# Submit all completed models
cogbench submit --name "Your Name"
# Submit a specific model
cogbench submit --model qwen2.5:14b --name "Your Name"
This creates a GitHub issue with your results. The maintainers will review and add them to the leaderboard. Requires the GitHub CLI (gh).
Python API
from cogbenchv2.passages.processor import load_all_passages
from cogbenchv2.evaluation.evaluate import evaluate_generations
from cogbenchv2.evaluation.metrics import compute_metrics
# Load the 120 bundled passages
passages = load_all_passages()
# Evaluate a generation file
evaluations = evaluate_generations("gen_qwen2_5_14b_standard.json", passages)
# Compute metrics
metrics = compute_metrics(evaluations)
print(f"Prompt-level strict: {metrics['prompt_level_strict']['rate']:.1%}")
Environment Variables
| Variable | Default | Description |
|---|---|---|
COGBENCH_RESULTS_DIR |
./data/results |
Where benchmark results are saved |
OPENAI_API_KEY |
— | For OpenAI API models |
GOOGLE_API_KEY |
— | For Google Gemini models |
TOGETHER_API_KEY |
— | For Together.ai models |
License
Apache 2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cogbench-1.1.0.tar.gz.
File metadata
- Download URL: cogbench-1.1.0.tar.gz
- Upload date:
- Size: 186.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3597ff1545e42518f9d5abb88b744fe7a27540c2f530554e6d94627de1c1cfb3
|
|
| MD5 |
99012fc47e8c23b4413359c837638a03
|
|
| BLAKE2b-256 |
0db9ddc5832af456660d1846e8d14fd0d041749db660d206adc5e51f7cf0c5b9
|
File details
Details for the file cogbench-1.1.0-py3-none-any.whl.
File metadata
- Download URL: cogbench-1.1.0-py3-none-any.whl
- Upload date:
- Size: 193.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e2f6adc696668c75e908e30e59a9a0a5b884b81e880e2eb84bf286304db56be
|
|
| MD5 |
9c2fea8e1c53f46ad1614182da7edf59
|
|
| BLAKE2b-256 |
52d35e7ba09bbbc8c7f642d109df62d4085074429620e2b532e2129cb9eb310d
|