Bayesian updating and Item Response Theory (MCQ 3PL, graded/continuous QA) for candidate skill assessment
This project has been archived.
The maintainers of this project have marked this project as archived. No new releases are expected.
Project description
hb-irt
Bayesian ability estimation and Item Response Theory (IRT) for candidate skill assessment. Implements the MCQ 3PL model, graded/continuous QA response models, sequential Bayesian updating across test modules, and MSAT (Multi-Stage Adaptive Testing) module selection and stopping rules.
This README is written for both human developers and AI coding agents working in this repository. If you are an agent picking up a task here, read Guidance for AI agents before making changes.
Source of truth for the underlying math: Candidate Skill Assessment Model Specification v2.pdf. Every non-obvious formula in this codebase cites a
section/equation number from that spec in its docstring.
Scope
See CLAUDE.md for the authoritative scope statement. Summary:
In scope: Bayesian updating (EAP/MAP, sequential updating across modules), IRT for MCQs (3PL), IRT for QA scored 0-10 (GRM) or 0-100 (CRM), and everything those require to function (Fisher information, Bloom-level difficulty mapping, item calibration, sub-skill score rescaling, MSAT module selection/stopping).
Out of scope: DAG-based hierarchical skill aggregation (spec §6) and anything downstream of it. This package's output boundary is the sub-skill score — it never computes a skill-level (DAG-aggregated) score.
Requirements
- Python >= 3.12
- uv as the package manager — do not use bare
pip/venvin this repo.
Install
uv sync
This creates .venv/ and installs numpy/scipy plus the pytest/pytest-cov/coverage dev group.
Test
uv run pytest
- Coverage is enforced at 90%+ via
--cov-fail-under=90inpyproject.toml(current actual coverage is 100%). Aterm-missingreport prints uncovered lines directly in the terminal;coverage.xmlis also written. - Run a single file:
uv run pytest tests/models/test_threepl.py - Run with verbose test names:
uv run pytest -v
Add a dependency
uv add <package> # runtime dependency
uv add --dev <package> # dev-only dependency (linters, test tools, etc.)
Architecture
src/hb_irt/
types.py Item, Response, Posterior, TestModule, SubskillScore
quadrature.py Gauss-Hermite quadrature (EAP integration primitive)
models/
base.py ItemModel interface: loglik(value, theta), info(theta)
threepl.py 3PL model for MCQ items
grm.py Graded Response Model for 0-10 ordinal QA
crm.py Continuous Response Model for 0-100 QA
bayes/
estimation.py EAP / MAP ability estimation
sequential.py Sequential Bayesian updating across modules
information.py Fisher test information, standard error of measurement
calibration.py MMLE/EM calibration of 3PL item parameters
bloom.py Bloom level (L1-L6) -> difficulty anchor + shrinkage
scoring.py 0-100 rescaling, precision-weighted level aggregation
msat/
module_bank.py TestModule repository, queryable by type/history
selection.py Information-maximizing module selection (EIG)
stopping.py Stopping rules (precision / max modules / saturation)
There is no re-exporting package __init__.py — import directly from the
submodule that defines what you need, e.g. from hb_irt.models.threepl import ThreePLModel, not from hb_irt import ThreePLModel. This keeps import paths
traceable to a single file and avoids a growing "god module."
Tests mirror the source layout 1:1 under tests/ (e.g.
src/hb_irt/models/grm.py <-> tests/models/test_grm.py).
Module map: what maps to what in the spec
| Module | Spec section | Purpose |
|---|---|---|
types.py |
Table 5, Table 7, eq 1 | Core data model: item params, responses, posteriors, modules, sub-skill scores |
quadrature.py |
Appendix A.1 | Gauss-Hermite nodes/weights and posterior-moment computation used by every Bayesian update |
models/threepl.py |
§3.1, eq 4; §A.2 | 3PL probability + Fisher information for MCQ items |
models/grm.py |
not in spec (see docstring) | Samejima GRM for 0-10 graded QA responses |
models/crm.py |
not in spec (see docstring) | Samejima continuous response model for 0-100 QA responses |
bayes/estimation.py |
§4.1-4.2, eq 7 | EAP (quadrature) and MAP (optimization) ability estimates |
bayes/sequential.py |
§4.4, eq 10 | Chains module posteriors: posterior_{t-1} becomes prior_t |
information.py |
eq 2, A.2 | Additive test information, SEM = 1/sqrt(I(theta)) |
calibration.py |
§3.2, eq 5 | MMLE via EM (Bock & Aitkin) to fit 3PL item parameters from response data |
bloom.py |
§3.3, Table 6, eq 6 | Bloom level difficulty anchors + empirical-Bayes shrinkage |
scoring.py |
§5.1-5.3, eq 11-15 | 0-100 rescaling (50 + 10*theta), margin of error, per-level precision-weighted aggregation |
msat/module_bank.py |
§2.1-2.2, Table 1-2 | Module repository, target ability ranges by type |
msat/selection.py |
§2.3-2.4, eq 2-3 | Information-maximizing module selection with exposure control |
msat/stopping.py |
§2.5, Table 3 | Precision / max-modules / min-items / saturation stopping rules |
Core concepts
- Ability scale: candidate ability (theta,
θ) is represented on the logit scale internally everywhere (typically in[-4, 4]). It is rescaled to a 0-100 "score" only at the reporting boundary, viascoring.rescale_0_100(score = 50 + 10*theta,margin = 19.6 * SE). Never do this rescaling inline elsewhere — always call intoscoring.py. - Posterior:
types.Posterior(mu, variance)is the universal representation of a belief about theta. It exposes.sem(standard error) and.credible_interval(level). - ItemModel interface: every item model (3PL, GRM, CRM) implements two
methods only —
loglik(value, theta)andinfo(theta). This is deliberately minimal: it's exactly what Bayesian estimation (bayes/estimation.py) and MSAT selection (msat/selection.py) need, and nothing else.valuemeans different things per model (0/1 for 3PL, an integer category for GRM, a raw 0-100 score for CRM) — the interface doesn't care, since it never inspectsvalueitself. - Gauss-Hermite quadrature:
quadrature.quadrature_grid(mu, sigma, n_points)returns(theta_nodes, weights)approximatingN(mu, sigma^2).quadrature.posterior_moments(log_likelihood, theta, prior_weight)combines a log-likelihood evaluated at those nodes with the prior weights to get a normalized posterior mean/variance. This pair of functions is the single numerical engine behindeap_estimate,expected_information_gain, and (indirectly)calibrate_3pl. If you need EAP-style integration anywhere else in this codebase, reuse these — don't reimplement quadrature.
Usage examples
MCQ item (3PL) — probability, log-likelihood, information
from hb_irt.types import Item
from hb_irt.models.threepl import ThreePLModel
item = ThreePLModel(Item(item_id="q1", a=1.2, b=0.3, c=0.2))
item.probability(theta=0.3) # -> 0.6 (== (1+c)/2 at theta == b, per Table 5)
item.loglik(value=1, theta=0.3)
item.info(theta=0.3) # Fisher information at theta=0.3
Graded QA response (0-10 scale) — GRM
from hb_irt.models.grm import GRMItem, GRMModel
# 10 boundaries -> 11 ordered categories (raw scores 0..10)
item = GRMModel(GRMItem(item_id="qa1", a=1.0, boundaries=(-2, -1, 0, 1, 2, 3, 4, 5, 6, 7)))
item.category_probabilities(theta=0.5) # -> array of 11 probabilities, sums to 1
item.loglik(value=7, theta=0.5) # value is the observed 0-10 category
item.info(theta=0.5)
Continuous QA response (0-100 scale) — CRM
from hb_irt.models.crm import CRMItem, CRMModel
item = CRMModel(CRMItem(item_id="qa2", a=1.0, b=0.0, max_score=100.0))
item.loglik(value=72.0, theta=0.4)
item.info(theta=0.4) # constant == a^2 for this model (see crm.py docstring)
EAP / MAP ability estimation from mixed item types
from hb_irt.bayes.estimation import eap_estimate, map_estimate
# Any mix of ThreePLModel / GRMModel / CRMModel instances works — each just
# contributes a scalar loglik(value, theta).
responses = [(mcq_model, 1), (grm_model, 8), (crm_model, 80.0)]
posterior = eap_estimate(responses, prior_mu=0.0, prior_sigma=1.0)
# posterior.mu, posterior.variance, posterior.sem, posterior.credible_interval(0.95)
theta_map = map_estimate(responses, prior_mu=0.0, prior_sigma=1.0)
Sequential updating across test modules
from hb_irt.types import Posterior
from hb_irt.bayes.sequential import sequential_update, sequential_update_all
prior = Posterior(mu=0.0, variance=1.0)
module_1_posterior = sequential_update(prior, module_1_responses)
module_2_posterior = sequential_update(module_1_posterior, module_2_responses)
# or in one call, given an ordered list of modules' responses:
history = sequential_update_all(prior, [module_1_responses, module_2_responses])
Posterior variance is guaranteed to be non-increasing across calls (assuming
non-degenerate item information) — this is asserted by
tests/bayes/test_sequential.py.
Item calibration (MMLE/EM)
import numpy as np
from hb_irt.calibration import calibrate_3pl
responses = np.array(...) # shape (n_examinees, n_items), binary 0/1
result = calibrate_3pl(responses, item_ids=["q1", "q2", "q3"], fixed_c=0.2)
result.items # tuple[Item, ...] with fitted a, b, (fixed) c
result.converged # bool
result.n_iterations # int
Use fixed_c=<value> when calibrating with fewer than ~500 responses per item
(spec §3.2); otherwise omit it to freely estimate guessing per item.
Bloom level difficulty mapping + shrinkage
from hb_irt.bloom import difficulty_anchor, shrink_difficulty
difficulty_anchor("L4") # -> 1.2 (Analysis, Table 6)
# Pull a noisy raw calibration estimate toward its Bloom-level anchor,
# precision-weighted by how confident each estimate is (spec eq 6).
shrunk_b = shrink_difficulty(raw_difficulty=1.9, raw_variance=0.3, level="L4", sigma_b=0.4)
Score rescaling and level aggregation
from hb_irt.types import Posterior
from hb_irt.scoring import aggregate_levels, rescale_0_100, build_subskill_score
# Combine per-Bloom-level EAP estimates into one sub-skill posterior (eq 14-15)
level_posterior = aggregate_levels(
level_thetas={"L1": 0.4, "L2": 0.6, "L3": 0.5},
level_variances={"L1": 0.05, "L2": 0.08, "L3": 0.06},
)
score, margin, ci_lower, ci_upper = rescale_0_100(level_posterior)
subskill_score = build_subskill_score(
subskill_id="sk_python_debugging",
posterior=level_posterior,
items_administered=42,
modules_completed=4,
level_thetas={"L1": 0.4, "L2": 0.6, "L3": 0.5},
)
# subskill_score.score_0_100, .margin_error_95, .ci_lower_95, .ci_upper_95, ...
MSAT: module bank, selection, stopping
from hb_irt.types import Posterior, TestModule
from hb_irt.msat.module_bank import ModuleBank
from hb_irt.msat.selection import select_next_module
from hb_irt.msat.stopping import StoppingConfig, evaluate_stopping
bank = ModuleBank(modules=(easy_module, medium_module, hard_module, challenge_module))
current_posterior = Posterior(mu=0.2, variance=0.6)
administered = ["easy_1"]
next_module = select_next_module(bank, current_posterior, administered_ids=administered)
decision = evaluate_stopping(
posterior=current_posterior,
previous_posterior=prior_posterior, # or None on the first module
n_modules=2,
n_items=15,
config=StoppingConfig(), # sigma_min=0.3, max_modules=8, min_items=20, delta_saturation=0.01
)
if decision.should_stop:
print("stopping:", decision.reasons) # e.g. ("precision_threshold",)
ModuleBank.available(administered_ids) implements the spec's B \ H set
(available minus already-administered). select_next_module picks the
module maximizing S(Q) = I_Q(mu_curr) + alpha * exp(-N_Q / beta) (spec
Algorithm §2.3), where N_Q is TestModule.n_exposures (exposure control).
Conventions
- Frozen dataclasses everywhere for value types (
Item,Response,Posterior,TestModule,SubskillScore,GRMItem,CRMItem). Don't make these mutable — callers should construct a new instance rather than mutate. - Validation happens in
__post_init__. RaiseValueErrorwith a message that includes the invalid value, e.g.f"discrimination a must be > 0, got {self.a}". Follow this pattern for new item/data types. - 21-40 point Gauss-Hermite quadrature (spec Appendix A.1) for all EAP-style
integration;
quadrature.DEFAULT_N_POINTS = 21is the default everywhere and should stay the default unless a caller has a specific reason to change it. - No global re-exports. Each new module should be imported by its own path.
Do not add symbols to
hb_irt/__init__.pybeyond the version string. - Docstrings cite the spec. Any formula-bearing function's docstring
should reference the section/equation number it implements (e.g.
(spec eq 14-15)), or explicitly say "not part of the spec" plus a citation for the model used (seemodels/grm.py,models/crm.pyfor the pattern), if the spec doesn't define it directly (as is the case for the QA graded/continuous models, which the spec doesn't cover — only binary MCQ items).
Guidance for AI agents
If you are an agent extending this repository, read this section fully before writing code.
Ground rules
- Respect the scope boundary. Anything that requires a skill-level
score (DAG propagation, weighted sub-skill→skill contribution,
non-compensatory gates, skill-level CIs — spec §6) is explicitly out of
scope. If a task description asks you to compute or import anything
downstream of a
SubskillScore, stop and flag it rather than implementing it — it belongs in a different package. See CLAUDE.md. - Every formula needs a citation. Before implementing new math, find the
corresponding section/equation in the spec PDF (or, if it's a QA-specific
extension the spec doesn't cover, use a citable IRT technique — e.g.
Samejima's GRM/CRM — and say so explicitly in the docstring, the way
models/grm.pyandmodels/crm.pydo). Do not invent ad hoc formulas without documenting the reasoning. - Maintain 90-100% test coverage.
pyproject.tomlenforces--cov-fail-under=90; the project currently sits at 100%. Any new module needs tests covering: the happy path, at least one boundary/edge case (extreme theta, boundary parameter values likecnear 0 or 1, category index 0 and the max category, empty/degenerate inputs), and everyraise ValueError/raise KeyErrorbranch. Runuv run pytestbefore declaring a task done — it will fail the whole suite if coverage drops below 90%. - Don't reimplement the quadrature primitive. If new code needs EAP-style
integration over a normal posterior, call
quadrature.quadrature_grid/quadrature.posterior_moments, don't hand-roll another Gauss-Hermite implementation. - Keep the
ItemModelinterface minimal. New item/response models (should another response type ever be needed) must implement onlyloglik(value, theta)andinfo(theta)to compose withbayes/estimation.pyandmsat/selection.pyunmodified. Don't widen the interface unless every existing model can implement the new method too.
A known pytest gotcha in this repo
Two names in this codebase intentionally collide with pytest's default collection prefixes:
types.TestModuleis a dataclass, not a test class. It sets__test__ = Falsein its class body so pytest skips it. If you add another class whose domain name happens to start withTest, do the same.information.test_informationis a function, not a test. If you import it into a test file, alias the import (seetests/test_information.py:from hb_irt.information import test_information as calc_test_information) — otherwise pytest tries to collect and call it as a zero-argument test function and errors out.
Where to make changes
- Adding a new IRT response model → new file in
models/, implementingItemModel; mirror test file intests/models/. - Adding a new Bayesian estimation technique →
bayes/estimation.pyor a new file inbayes/, reusingquadrature.py. - Adding a new stopping/selection rule →
msat/stopping.py/msat/selection.py; extendStoppingConfig/StoppingDecisionrather than inventing a parallel structure. - Anything touching score presentation/rescaling →
scoring.pyonly; don't inline50 + 10*thetaelsewhere.
Before finishing a task
Run, in order:
uv run pytest # full suite + coverage gate
If coverage fails, the terminal output (term-missing) tells you exactly
which lines are uncovered — add a test for that branch rather than lowering
the threshold. Do not lower --cov-fail-under in pyproject.toml to make a
task pass.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hb_irt-0.1.1.tar.gz.
File metadata
- Download URL: hb_irt-0.1.1.tar.gz
- Upload date:
- Size: 19.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
17de3510b049920ea3990c374478ebc7c8a3dbfcb58359c4e4c0f638faf1cbff
|
|
| MD5 |
4d854424352b630727223b76ec7ed064
|
|
| BLAKE2b-256 |
8ca12d5e20aaabfbd722c2e5f55205de4dd77c05ab374426a0748bca1f62e01a
|
File details
Details for the file hb_irt-0.1.1-py3-none-any.whl.
File metadata
- Download URL: hb_irt-0.1.1-py3-none-any.whl
- Upload date:
- Size: 26.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9d4240e575f9edf4784d13539a5eb995a11dcebd2aec50dac8b0b3fdfc177da3
|
|
| MD5 |
916662215c5a294ce1d2679ec05141a8
|
|
| BLAKE2b-256 |
5c60e6ff2fd26d23fad93720ea8fc1ce40b2900c8bf5c0a3e79b2d173384a7cf
|