Bayesian updating and Item Response Theory (MCQ 3PL, graded/continuous QA) for candidate skill assessment
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
hb-irt scores candidate responses — multiple-choice questions and open-ended
questions graded on a 0-10 or 0-100 scale — into a single latent ability
estimate with a calibrated uncertainty range, and supports adaptive,
multi-stage testing on top of it.
Contents
Features
| Item models | 3PL (multiple-choice), GRM (0-10 graded), CRM (0-100 continuous) |
| Ability estimation | EAP (Gauss-Hermite quadrature) and MAP, with posterior variance and 95% credible intervals |
| Sequential updating | Each test stage's posterior becomes the next stage's prior — no re-scoring from scratch |
| Item calibration | Marginal Maximum Likelihood (MMLE/EM) fit from raw response data |
| Score aggregation | Precision-weighted combination of per-topic estimates into a 0-100 score ± margin of error |
| Adaptive testing | Information-maximizing module selection with exposure control, and configurable stopping rules |
Responses from any mix of item types combine into a single ability estimate:
| Response type | Model | value represents |
|---|---|---|
| Multiple-choice | 3PL | 0 / 1 |
| Open-ended, scored 0-10 | GRM | integer category 0–10 |
| Open-ended, scored 0-100 | CRM | continuous score 0–100 |
Installation
pip install hb-irt
or, with uv:
uv add hb-irt
Requires Python 3.12+. Depends on
numpyandscipy.
Quickstart
from hb_irt.types import Item
from hb_irt.models.threepl import ThreePLModel
from hb_irt.bayes.estimation import eap_estimate
from hb_irt.bayes.sequential import sequential_update
from hb_irt.scoring import build_subskill_score
# Define a small item bank (discrimination, difficulty, guessing)
items = [Item(item_id=f"q{i}", a=1.2, b=b, c=0.2)
for i, b in enumerate([-1.0, -0.3, 0.4, 1.0, 1.6])]
models = [ThreePLModel(item) for item in items]
# Stage 1: estimate ability from a prior N(0, 1)
posterior = eap_estimate(list(zip(models, [1, 1, 0, 1, 0])), prior_mu=0.0, prior_sigma=1.0)
# Stage 2: the previous posterior becomes the new prior
posterior = sequential_update(posterior, list(zip(models, [1, 1, 1, 0, 1])))
# Rescale to a 0-100 score with a 95% margin of error
score = build_subskill_score(
"python_debugging", posterior, items_administered=10, modules_completed=2
)
print(f"{score.score_0_100:.1f} ± {score.margin_error_95:.1f}")
Concepts
- Ability (θ) is represented on a logit scale internally, typically in
the range
[-4, 4]. Usehb_irt.scoring.rescale_0_100to convert a posterior to a 0-100 score with a margin of error whenever you need to display it. Posterior(mu, variance)is the shared representation of a belief about a candidate's ability throughout the library. It exposes.sem(standard error) and.credible_interval(level).- Item models (3PL, GRM, CRM) share a minimal common interface:
loglik(value, theta)andinfo(theta)— which is what lets responses of every type combine into a single ability estimate.
Usage guide
Multiple-choice items (3PL)
from hb_irt.types import Item
from hb_irt.models.threepl import ThreePLModel
item = ThreePLModel(Item(item_id="q1", a=1.2, b=0.3, c=0.2))
item.probability(theta=0.3) # probability of a correct response at theta=0.3
item.loglik(value=1, theta=0.3)
item.info(theta=0.3) # Fisher information at theta=0.3
Open-ended answers scored 0-10 (Graded Response Model)
from hb_irt.models.grm import GRMItem, GRMModel
# 10 boundaries define 11 ordered categories (scores 0..10)
item = GRMModel(GRMItem(item_id="qa1", a=1.0, boundaries=(-2, -1, 0, 1, 2, 3, 4, 5, 6, 7)))
item.category_probabilities(theta=0.5) # probability of each of the 11 scores
item.loglik(value=7, theta=0.5) # value is the observed 0-10 score
item.info(theta=0.5)
Open-ended answers scored 0-100 (Continuous Response Model)
from hb_irt.models.crm import CRMItem, CRMModel
item = CRMModel(CRMItem(item_id="qa2", a=1.0, b=0.0, max_score=100.0))
item.loglik(value=72.0, theta=0.4)
item.info(theta=0.4)
Ability estimation (EAP / MAP) across mixed item types
from hb_irt.bayes.estimation import eap_estimate, map_estimate
# Any mix of item models works, since each just contributes a scalar
# loglik(value, theta) — MCQ, graded, and continuous responses combine freely.
responses = [(mcq_model, 1), (grm_model, 8), (crm_model, 80.0)]
posterior = eap_estimate(responses, prior_mu=0.0, prior_sigma=1.0)
# posterior.mu, posterior.variance, posterior.sem, posterior.credible_interval(0.95)
theta_map = map_estimate(responses, prior_mu=0.0, prior_sigma=1.0)
Sequential updating across test stages
from hb_irt.types import Posterior
from hb_irt.bayes.sequential import sequential_update, sequential_update_all
prior = Posterior(mu=0.0, variance=1.0)
stage_1_posterior = sequential_update(prior, stage_1_responses)
stage_2_posterior = sequential_update(stage_1_posterior, stage_2_responses)
# or in one call, given an ordered list of each stage's responses:
history = sequential_update_all(prior, [stage_1_responses, stage_2_responses])
Posterior variance is guaranteed to never increase across stages (assuming non-degenerate item information), so estimates only get more precise as a candidate answers more items.
Item calibration (MMLE/EM)
Fit 3PL item parameters from a batch of raw response data:
import numpy as np
from hb_irt.calibration import calibrate_3pl
responses = np.array(...) # shape (n_examinees, n_items), binary 0/1
result = calibrate_3pl(responses, item_ids=["q1", "q2", "q3"], fixed_c=0.2)
result.items # tuple of Item, with fitted discrimination/difficulty (and fixed guessing)
result.converged # bool
result.n_iterations # int
Pass
fixed_c=<value>when calibrating with fewer than ~500 responses per item; otherwise omit it to freely estimate a guessing parameter per item.
Difficulty mapping by cognitive level
from hb_irt.bloom import difficulty_anchor, shrink_difficulty
difficulty_anchor("L4") # -> 1.2 (an "Analysis"-level item's typical difficulty)
# Pull a noisy raw calibration estimate toward its level's typical difficulty,
# weighted by how confident each estimate is.
shrunk_b = shrink_difficulty(raw_difficulty=1.9, raw_variance=0.3, level="L4", sigma_b=0.4)
Score rescaling and topic aggregation
from hb_irt.scoring import aggregate_levels, rescale_0_100, build_subskill_score
# Combine estimates from several cognitive levels into one topic-level posterior
level_posterior = aggregate_levels(
level_thetas={"L1": 0.4, "L2": 0.6, "L3": 0.5},
level_variances={"L1": 0.05, "L2": 0.08, "L3": 0.06},
)
score, margin, ci_lower, ci_upper = rescale_0_100(level_posterior)
subskill_score = build_subskill_score(
subskill_id="python_debugging",
posterior=level_posterior,
items_administered=42,
modules_completed=4,
level_thetas={"L1": 0.4, "L2": 0.6, "L3": 0.5},
)
# subskill_score.score_0_100, .margin_error_95, .ci_lower_95, .ci_upper_95, ...
Adaptive testing (MSAT): module bank, selection, stopping
from hb_irt.types import Posterior
from hb_irt.msat.module_bank import ModuleBank
from hb_irt.msat.selection import select_next_module
from hb_irt.msat.stopping import StoppingConfig, evaluate_stopping
bank = ModuleBank(modules=(easy_module, medium_module, hard_module, challenge_module))
current_posterior = Posterior(mu=0.2, variance=0.6)
administered = ["easy_1"]
next_module = select_next_module(bank, current_posterior, administered_ids=administered)
decision = evaluate_stopping(
posterior=current_posterior,
previous_posterior=prior_posterior, # or None on the first module
n_modules=2,
n_items=15,
config=StoppingConfig(), # sigma_min=0.3, max_modules=8, min_items=20, delta_saturation=0.01
)
if decision.should_stop:
print("stopping:", decision.reasons) # e.g. ("precision_threshold",)
ModuleBank.available(administered_ids) returns modules not yet given to a
candidate. select_next_module picks the module that maximizes expected
information gain at the candidate's current ability estimate, with an
exposure-control bonus that favors less-used modules.
Package layout
| Module | Provides |
|---|---|
hb_irt.types |
Core data types: Item, Response, Posterior, TestModule, SubskillScore |
hb_irt.models.threepl |
3PL model for multiple-choice items |
hb_irt.models.grm |
Graded Response Model for 0-10 scored answers |
hb_irt.models.crm |
Continuous Response Model for 0-100 scored answers |
hb_irt.bayes.estimation |
EAP and MAP ability estimation |
hb_irt.bayes.sequential |
Sequential Bayesian updating across test stages |
hb_irt.information |
Fisher test information and standard error of measurement |
hb_irt.calibration |
MMLE/EM calibration of 3PL item parameters |
hb_irt.bloom |
Cognitive-level difficulty anchors and shrinkage |
hb_irt.scoring |
0-100 rescaling and precision-weighted level aggregation |
hb_irt.msat |
Adaptive module bank, selection, and stopping rules |
Import directly from the submodule you need, e.g.
from hb_irt.models.threepl import ThreePLModel.
Development
For contribution guidelines, architecture notes, and project conventions, see CLAUDE.md.
uv sync
uv run pytest
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hb_irt-0.1.2.tar.gz.
File metadata
- Download URL: hb_irt-0.1.2.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4572bb7af605b48f6863e2d480b593738b162743e0ebabb390376acabdc359b7
|
|
| MD5 |
7ccb6f88ba8fbf3ca43fbbaf7840bb79
|
|
| BLAKE2b-256 |
cdcd201bcc2ad6929b5f6c327814f98accb668bf59ccb2acee2a4b6e7d45804b
|
File details
Details for the file hb_irt-0.1.2-py3-none-any.whl.
File metadata
- Download URL: hb_irt-0.1.2-py3-none-any.whl
- Upload date:
- Size: 23.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
812c6dde8b8210b514a996fd3ef7430a738a0def600133eb36e33abc76d4e995
|
|
| MD5 |
cde2d04186e2e9387b9a139bbb47d088
|
|
| BLAKE2b-256 |
8ac5890b6e22a2a17ccad1aee7b4ad8b8881201f19e62adf3b811bcbd27d273b
|