LLM-based evaluation of multiple-choice items against item-writing guidelines
Project description
itemwise
LLM-based evaluation of multiple-choice items against item-writing guidelines.
Evaluate the quality of multiple-choice questions (MCQs) using the 43 item-writing rules from Haladyna & Downing (1989), powered by any LLM provider via litellm.
Installation
pip install git+https://github.com/kikagaku/itemwise.git
Or with uv:
uv add git+https://github.com/kikagaku/itemwise.git
Requires Python 3.12+.
Quick Start
from itemwise import evaluate
result = evaluate(
item={
"stem": "Which of the following is NOT a characteristic of mammals?",
"options": [
"They are warm-blooded",
"They lay eggs",
"They have hair or fur",
"They produce milk",
],
"correct": 1,
},
model="azure/gpt-5.1-chat",
)
print(result.score) # 0.95 (fraction of rules passed)
print(result.violations) # [RuleResult(rule_id=22, ...)]
Features
- Evaluate MCQs against 41 item-writing rules (43 total, 2 batch-level rules excluded by default)
- Sync and async API (
evaluate,async_evaluate) - Batch evaluation with tqdm progress bar (
evaluate_batch,async_evaluate_batch) - Structured JSON output via
response_formatfor reliable LLM responses - Token usage and cost tracking (
UsageInfo) - Automatic retry on JSON parse failures
- Any LLM provider supported through litellm
- CLI with flexible parameter passthrough
Usage
Library API
from itemwise import evaluate, evaluate_batch, async_evaluate_batch
# Single item evaluation (default: 41 rules)
result = evaluate(item=item, model="azure/gpt-5.1-chat")
# Select specific rules by ID
result = evaluate(item=item, model="azure/gpt-5.1-chat", rules=[22, 28, 37])
# Batch evaluation with progress bar
results = evaluate_batch(items=[item1, item2, ...], model="azure/gpt-5.1-chat")
# Disable progress bar
results = evaluate_batch(items=items, model="azure/gpt-5.1-chat", progress=False)
# Async batch evaluation (parallel LLM calls)
results = await async_evaluate_batch(items=items, model="azure/gpt-5.1-chat")
# Pass any LLM parameters through to litellm
result = evaluate(item=item, model="azure/gpt-5.1-chat", reasoning_effort="low")
Token Usage and Cost
result = evaluate(item=item, model="azure/gpt-5.1-chat")
print(result.usage.prompt_tokens) # 304
print(result.usage.completion_tokens) # 226
print(result.usage.total_tokens) # 530
print(result.usage.cost) # 0.00264
CLI
# Evaluate items from a JSON file
itemwise evaluate questions.json --model azure/gpt-5.1-chat
# Select specific rules
itemwise evaluate questions.json --model azure/gpt-5.1-chat --rules 22,28,37
# Pass LLM parameters
itemwise evaluate questions.json --model azure/gpt-5.1-chat --param reasoning_effort=low
# Show version
itemwise --version
Input JSON format:
[
{
"stem": "Question text",
"options": ["Option A", "Option B", "Option C", "Option D"],
"correct": 0
}
]
LLM Configuration
The LLM backend is connected via litellm. Model names and parameters follow litellm conventions.
For Azure OpenAI, set the following environment variables:
export AZURE_API_KEY=your-key
export AZURE_API_BASE=https://your-resource.cognitiveservices.azure.com/
export AZURE_API_VERSION=2024-12-01-preview
See the litellm documentation for other providers (OpenAI, Anthropic, Google, etc.).
Item-Writing Rules
Evaluates MCQs against 43 rules from Haladyna & Downing (1989), organized in 6 categories:
| Category | Rules | Description |
|---|---|---|
| General (Procedural) | 1-7 | Item format, grammar, readability |
| General (Content) | 8-17 | Educational objectives, vocabulary level, higher-order thinking |
| Stem Construction | 18-23 | Stem clarity, positive wording, central idea placement |
| General Option | 24-35 | Option count, order, homogeneity, length consistency |
| Correct Option | 36-37 | Answer position distribution, uniqueness |
| Distractor | 38-43 | Plausibility, common errors, avoiding humor |
Rules 11 (item independence) and 36 (correct answer position distribution) require cross-item analysis and are excluded from default evaluation. They can be explicitly included via the rules parameter, but single-item evaluation accuracy is limited for these rules.
References
- Haladyna, T. M., & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2(1), 37-50.
- Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309-333.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file itemwise-0.1.0.tar.gz.
File metadata
- Download URL: itemwise-0.1.0.tar.gz
- Upload date:
- Size: 9.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cffcb19d15958f566f1cc5979c80cf9aaf04af4ddc26e05b1aba3a9bd602248c
|
|
| MD5 |
e93fba271e78a9536bf23e39a2d3c887
|
|
| BLAKE2b-256 |
185734c7917624219bc6375109f7abe97350243d70c67ddad2919b77ece902a7
|
File details
Details for the file itemwise-0.1.0-py3-none-any.whl.
File metadata
- Download URL: itemwise-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d059c90eaa8e59076fd9763b80b5c770c7ebdf54e59e302a634d5f24f3adba47
|
|
| MD5 |
6e25d1607a479d162ec8c2cdc6f07dee
|
|
| BLAKE2b-256 |
ea114b65b713de772d228c28643fbb00f7e529d761ed097729306e753f0c4c47
|