Skip to main content

Bayesian updating and Item Response Theory (MCQ 3PL, graded/continuous QA) for candidate skill assessment

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

hb-irt

Bayesian ability estimation and Item Response Theory (IRT) for candidate skill assessment. Implements the MCQ 3PL model, graded/continuous QA response models, sequential Bayesian updating across test modules, and MSAT (Multi-Stage Adaptive Testing) module selection and stopping rules.

This README is written for both human developers and AI coding agents working in this repository. If you are an agent picking up a task here, read Guidance for AI agents before making changes.

Source of truth for the underlying math: Candidate Skill Assessment Model Specification v2.pdf. Every non-obvious formula in this codebase cites a section/equation number from that spec in its docstring.


Scope

See CLAUDE.md for the authoritative scope statement. Summary:

In scope: Bayesian updating (EAP/MAP, sequential updating across modules), IRT for MCQs (3PL), IRT for QA scored 0-10 (GRM) or 0-100 (CRM), and everything those require to function (Fisher information, Bloom-level difficulty mapping, item calibration, sub-skill score rescaling, MSAT module selection/stopping).

Out of scope: DAG-based hierarchical skill aggregation (spec §6) and anything downstream of it. This package's output boundary is the sub-skill score — it never computes a skill-level (DAG-aggregated) score.


Requirements

  • Python >= 3.12
  • uv as the package manager — do not use bare pip/venv in this repo.

Install

uv sync

This creates .venv/ and installs numpy/scipy plus the pytest/pytest-cov/coverage dev group.

Test

uv run pytest
  • Coverage is enforced at 90%+ via --cov-fail-under=90 in pyproject.toml (current actual coverage is 100%). A term-missing report prints uncovered lines directly in the terminal; coverage.xml is also written.
  • Run a single file: uv run pytest tests/models/test_threepl.py
  • Run with verbose test names: uv run pytest -v

Add a dependency

uv add <package>          # runtime dependency
uv add --dev <package>    # dev-only dependency (linters, test tools, etc.)

Architecture

src/hb_irt/
  types.py            Item, Response, Posterior, TestModule, SubskillScore
  quadrature.py        Gauss-Hermite quadrature (EAP integration primitive)
  models/
    base.py             ItemModel interface: loglik(value, theta), info(theta)
    threepl.py           3PL model for MCQ items
    grm.py                Graded Response Model for 0-10 ordinal QA
    crm.py                 Continuous Response Model for 0-100 QA
  bayes/
    estimation.py         EAP / MAP ability estimation
    sequential.py          Sequential Bayesian updating across modules
  information.py       Fisher test information, standard error of measurement
  calibration.py       MMLE/EM calibration of 3PL item parameters
  bloom.py             Bloom level (L1-L6) -> difficulty anchor + shrinkage
  scoring.py           0-100 rescaling, precision-weighted level aggregation
  msat/
    module_bank.py       TestModule repository, queryable by type/history
    selection.py           Information-maximizing module selection (EIG)
    stopping.py             Stopping rules (precision / max modules / saturation)

There is no re-exporting package __init__.py — import directly from the submodule that defines what you need, e.g. from hb_irt.models.threepl import ThreePLModel, not from hb_irt import ThreePLModel. This keeps import paths traceable to a single file and avoids a growing "god module."

Tests mirror the source layout 1:1 under tests/ (e.g. src/hb_irt/models/grm.py <-> tests/models/test_grm.py).

Module map: what maps to what in the spec

Module Spec section Purpose
types.py Table 5, Table 7, eq 1 Core data model: item params, responses, posteriors, modules, sub-skill scores
quadrature.py Appendix A.1 Gauss-Hermite nodes/weights and posterior-moment computation used by every Bayesian update
models/threepl.py §3.1, eq 4; §A.2 3PL probability + Fisher information for MCQ items
models/grm.py not in spec (see docstring) Samejima GRM for 0-10 graded QA responses
models/crm.py not in spec (see docstring) Samejima continuous response model for 0-100 QA responses
bayes/estimation.py §4.1-4.2, eq 7 EAP (quadrature) and MAP (optimization) ability estimates
bayes/sequential.py §4.4, eq 10 Chains module posteriors: posterior_{t-1} becomes prior_t
information.py eq 2, A.2 Additive test information, SEM = 1/sqrt(I(theta))
calibration.py §3.2, eq 5 MMLE via EM (Bock & Aitkin) to fit 3PL item parameters from response data
bloom.py §3.3, Table 6, eq 6 Bloom level difficulty anchors + empirical-Bayes shrinkage
scoring.py §5.1-5.3, eq 11-15 0-100 rescaling (50 + 10*theta), margin of error, per-level precision-weighted aggregation
msat/module_bank.py §2.1-2.2, Table 1-2 Module repository, target ability ranges by type
msat/selection.py §2.3-2.4, eq 2-3 Information-maximizing module selection with exposure control
msat/stopping.py §2.5, Table 3 Precision / max-modules / min-items / saturation stopping rules

Core concepts

  • Ability scale: candidate ability (theta, θ) is represented on the logit scale internally everywhere (typically in [-4, 4]). It is rescaled to a 0-100 "score" only at the reporting boundary, via scoring.rescale_0_100 (score = 50 + 10*theta, margin = 19.6 * SE). Never do this rescaling inline elsewhere — always call into scoring.py.
  • Posterior: types.Posterior(mu, variance) is the universal representation of a belief about theta. It exposes .sem (standard error) and .credible_interval(level).
  • ItemModel interface: every item model (3PL, GRM, CRM) implements two methods only — loglik(value, theta) and info(theta). This is deliberately minimal: it's exactly what Bayesian estimation (bayes/estimation.py) and MSAT selection (msat/selection.py) need, and nothing else. value means different things per model (0/1 for 3PL, an integer category for GRM, a raw 0-100 score for CRM) — the interface doesn't care, since it never inspects value itself.
  • Gauss-Hermite quadrature: quadrature.quadrature_grid(mu, sigma, n_points) returns (theta_nodes, weights) approximating N(mu, sigma^2). quadrature.posterior_moments(log_likelihood, theta, prior_weight) combines a log-likelihood evaluated at those nodes with the prior weights to get a normalized posterior mean/variance. This pair of functions is the single numerical engine behind eap_estimate, expected_information_gain, and (indirectly) calibrate_3pl. If you need EAP-style integration anywhere else in this codebase, reuse these — don't reimplement quadrature.

Usage examples

MCQ item (3PL) — probability, log-likelihood, information

from hb_irt.types import Item
from hb_irt.models.threepl import ThreePLModel

item = ThreePLModel(Item(item_id="q1", a=1.2, b=0.3, c=0.2))
item.probability(theta=0.3)   # -> 0.6  (== (1+c)/2 at theta == b, per Table 5)
item.loglik(value=1, theta=0.3)
item.info(theta=0.3)          # Fisher information at theta=0.3

Graded QA response (0-10 scale) — GRM

from hb_irt.models.grm import GRMItem, GRMModel

# 10 boundaries -> 11 ordered categories (raw scores 0..10)
item = GRMModel(GRMItem(item_id="qa1", a=1.0, boundaries=(-2, -1, 0, 1, 2, 3, 4, 5, 6, 7)))
item.category_probabilities(theta=0.5)   # -> array of 11 probabilities, sums to 1
item.loglik(value=7, theta=0.5)          # value is the observed 0-10 category
item.info(theta=0.5)

Continuous QA response (0-100 scale) — CRM

from hb_irt.models.crm import CRMItem, CRMModel

item = CRMModel(CRMItem(item_id="qa2", a=1.0, b=0.0, max_score=100.0))
item.loglik(value=72.0, theta=0.4)
item.info(theta=0.4)   # constant == a^2 for this model (see crm.py docstring)

EAP / MAP ability estimation from mixed item types

from hb_irt.bayes.estimation import eap_estimate, map_estimate

# Any mix of ThreePLModel / GRMModel / CRMModel instances works — each just
# contributes a scalar loglik(value, theta).
responses = [(mcq_model, 1), (grm_model, 8), (crm_model, 80.0)]

posterior = eap_estimate(responses, prior_mu=0.0, prior_sigma=1.0)
# posterior.mu, posterior.variance, posterior.sem, posterior.credible_interval(0.95)

theta_map = map_estimate(responses, prior_mu=0.0, prior_sigma=1.0)

Sequential updating across test modules

from hb_irt.types import Posterior
from hb_irt.bayes.sequential import sequential_update, sequential_update_all

prior = Posterior(mu=0.0, variance=1.0)
module_1_posterior = sequential_update(prior, module_1_responses)
module_2_posterior = sequential_update(module_1_posterior, module_2_responses)

# or in one call, given an ordered list of modules' responses:
history = sequential_update_all(prior, [module_1_responses, module_2_responses])

Posterior variance is guaranteed to be non-increasing across calls (assuming non-degenerate item information) — this is asserted by tests/bayes/test_sequential.py.

Item calibration (MMLE/EM)

import numpy as np
from hb_irt.calibration import calibrate_3pl

responses = np.array(...)  # shape (n_examinees, n_items), binary 0/1
result = calibrate_3pl(responses, item_ids=["q1", "q2", "q3"], fixed_c=0.2)
result.items          # tuple[Item, ...] with fitted a, b, (fixed) c
result.converged       # bool
result.n_iterations    # int

Use fixed_c=<value> when calibrating with fewer than ~500 responses per item (spec §3.2); otherwise omit it to freely estimate guessing per item.

Bloom level difficulty mapping + shrinkage

from hb_irt.bloom import difficulty_anchor, shrink_difficulty

difficulty_anchor("L4")  # -> 1.2  (Analysis, Table 6)

# Pull a noisy raw calibration estimate toward its Bloom-level anchor,
# precision-weighted by how confident each estimate is (spec eq 6).
shrunk_b = shrink_difficulty(raw_difficulty=1.9, raw_variance=0.3, level="L4", sigma_b=0.4)

Score rescaling and level aggregation

from hb_irt.types import Posterior
from hb_irt.scoring import aggregate_levels, rescale_0_100, build_subskill_score

# Combine per-Bloom-level EAP estimates into one sub-skill posterior (eq 14-15)
level_posterior = aggregate_levels(
    level_thetas={"L1": 0.4, "L2": 0.6, "L3": 0.5},
    level_variances={"L1": 0.05, "L2": 0.08, "L3": 0.06},
)

score, margin, ci_lower, ci_upper = rescale_0_100(level_posterior)

subskill_score = build_subskill_score(
    subskill_id="sk_python_debugging",
    posterior=level_posterior,
    items_administered=42,
    modules_completed=4,
    level_thetas={"L1": 0.4, "L2": 0.6, "L3": 0.5},
)
# subskill_score.score_0_100, .margin_error_95, .ci_lower_95, .ci_upper_95, ...

MSAT: module bank, selection, stopping

from hb_irt.types import Posterior, TestModule
from hb_irt.msat.module_bank import ModuleBank
from hb_irt.msat.selection import select_next_module
from hb_irt.msat.stopping import StoppingConfig, evaluate_stopping

bank = ModuleBank(modules=(easy_module, medium_module, hard_module, challenge_module))

current_posterior = Posterior(mu=0.2, variance=0.6)
administered = ["easy_1"]

next_module = select_next_module(bank, current_posterior, administered_ids=administered)

decision = evaluate_stopping(
    posterior=current_posterior,
    previous_posterior=prior_posterior,   # or None on the first module
    n_modules=2,
    n_items=15,
    config=StoppingConfig(),  # sigma_min=0.3, max_modules=8, min_items=20, delta_saturation=0.01
)
if decision.should_stop:
    print("stopping:", decision.reasons)  # e.g. ("precision_threshold",)

ModuleBank.available(administered_ids) implements the spec's B \ H set (available minus already-administered). select_next_module picks the module maximizing S(Q) = I_Q(mu_curr) + alpha * exp(-N_Q / beta) (spec Algorithm §2.3), where N_Q is TestModule.n_exposures (exposure control).


Conventions

  • Frozen dataclasses everywhere for value types (Item, Response, Posterior, TestModule, SubskillScore, GRMItem, CRMItem). Don't make these mutable — callers should construct a new instance rather than mutate.
  • Validation happens in __post_init__. Raise ValueError with a message that includes the invalid value, e.g. f"discrimination a must be > 0, got {self.a}". Follow this pattern for new item/data types.
  • 21-40 point Gauss-Hermite quadrature (spec Appendix A.1) for all EAP-style integration; quadrature.DEFAULT_N_POINTS = 21 is the default everywhere and should stay the default unless a caller has a specific reason to change it.
  • No global re-exports. Each new module should be imported by its own path. Do not add symbols to hb_irt/__init__.py beyond the version string.
  • Docstrings cite the spec. Any formula-bearing function's docstring should reference the section/equation number it implements (e.g. (spec eq 14-15)), or explicitly say "not part of the spec" plus a citation for the model used (see models/grm.py, models/crm.py for the pattern), if the spec doesn't define it directly (as is the case for the QA graded/continuous models, which the spec doesn't cover — only binary MCQ items).

Guidance for AI agents

If you are an agent extending this repository, read this section fully before writing code.

Ground rules

  1. Respect the scope boundary. Anything that requires a skill-level score (DAG propagation, weighted sub-skill→skill contribution, non-compensatory gates, skill-level CIs — spec §6) is explicitly out of scope. If a task description asks you to compute or import anything downstream of a SubskillScore, stop and flag it rather than implementing it — it belongs in a different package. See CLAUDE.md.
  2. Every formula needs a citation. Before implementing new math, find the corresponding section/equation in the spec PDF (or, if it's a QA-specific extension the spec doesn't cover, use a citable IRT technique — e.g. Samejima's GRM/CRM — and say so explicitly in the docstring, the way models/grm.py and models/crm.py do). Do not invent ad hoc formulas without documenting the reasoning.
  3. Maintain 90-100% test coverage. pyproject.toml enforces --cov-fail-under=90; the project currently sits at 100%. Any new module needs tests covering: the happy path, at least one boundary/edge case (extreme theta, boundary parameter values like c near 0 or 1, category index 0 and the max category, empty/degenerate inputs), and every raise ValueError/raise KeyError branch. Run uv run pytest before declaring a task done — it will fail the whole suite if coverage drops below 90%.
  4. Don't reimplement the quadrature primitive. If new code needs EAP-style integration over a normal posterior, call quadrature.quadrature_grid/quadrature.posterior_moments, don't hand-roll another Gauss-Hermite implementation.
  5. Keep the ItemModel interface minimal. New item/response models (should another response type ever be needed) must implement only loglik(value, theta) and info(theta) to compose with bayes/estimation.py and msat/selection.py unmodified. Don't widen the interface unless every existing model can implement the new method too.

A known pytest gotcha in this repo

Two names in this codebase intentionally collide with pytest's default collection prefixes:

  • types.TestModule is a dataclass, not a test class. It sets __test__ = False in its class body so pytest skips it. If you add another class whose domain name happens to start with Test, do the same.
  • information.test_information is a function, not a test. If you import it into a test file, alias the import (see tests/test_information.py: from hb_irt.information import test_information as calc_test_information) — otherwise pytest tries to collect and call it as a zero-argument test function and errors out.

Where to make changes

  • Adding a new IRT response model → new file in models/, implementing ItemModel; mirror test file in tests/models/.
  • Adding a new Bayesian estimation technique → bayes/estimation.py or a new file in bayes/, reusing quadrature.py.
  • Adding a new stopping/selection rule → msat/stopping.py / msat/selection.py; extend StoppingConfig/StoppingDecision rather than inventing a parallel structure.
  • Anything touching score presentation/rescaling → scoring.py only; don't inline 50 + 10*theta elsewhere.

Before finishing a task

Run, in order:

uv run pytest              # full suite + coverage gate

If coverage fails, the terminal output (term-missing) tells you exactly which lines are uncovered — add a test for that branch rather than lowering the threshold. Do not lower --cov-fail-under in pyproject.toml to make a task pass.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hb_irt-0.1.1.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hb_irt-0.1.1-py3-none-any.whl (26.6 kB view details)

Uploaded Python 3

File details

Details for the file hb_irt-0.1.1.tar.gz.

File metadata

  • Download URL: hb_irt-0.1.1.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hb_irt-0.1.1.tar.gz
Algorithm Hash digest
SHA256 17de3510b049920ea3990c374478ebc7c8a3dbfcb58359c4e4c0f638faf1cbff
MD5 4d854424352b630727223b76ec7ed064
BLAKE2b-256 8ca12d5e20aaabfbd722c2e5f55205de4dd77c05ab374426a0748bca1f62e01a

See more details on using hashes here.

File details

Details for the file hb_irt-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: hb_irt-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hb_irt-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9d4240e575f9edf4784d13539a5eb995a11dcebd2aec50dac8b0b3fdfc177da3
MD5 916662215c5a294ce1d2679ec05141a8
BLAKE2b-256 5c60e6ff2fd26d23fad93720ea8fc1ce40b2900c8bf5c0a3e79b2d173384a7cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page