Skip to main content

Bayesian updating and Item Response Theory (MCQ 3PL, graded/continuous QA) for candidate skill assessment

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

hb-irt

Bayesian ability estimation and Item Response Theory (IRT) for skill assessment

PyPI version Python versions License: MIT Coverage

hb-irt scores candidate responses — multiple-choice questions and open-ended questions graded on a 0-10 or 0-100 scale — into a single latent ability estimate with a calibrated uncertainty range, and supports adaptive, multi-stage testing on top of it.


Contents


Features

Item models 3PL (multiple-choice), GRM (0-10 graded), CRM (0-100 continuous)
Ability estimation EAP (Gauss-Hermite quadrature) and MAP, with posterior variance and 95% credible intervals
Sequential updating Each test stage's posterior becomes the next stage's prior — no re-scoring from scratch
Item calibration Marginal Maximum Likelihood (MMLE/EM) fit from raw response data
Score aggregation Precision-weighted combination of per-topic estimates into a 0-100 score ± margin of error
Adaptive testing Information-maximizing module selection with exposure control, and configurable stopping rules

Responses from any mix of item types combine into a single ability estimate:

Response type Model value represents
Multiple-choice 3PL 0 / 1
Open-ended, scored 0-10 GRM integer category 010
Open-ended, scored 0-100 CRM continuous score 0100

Installation

pip install hb-irt

or, with uv:

uv add hb-irt

Requires Python 3.12+. Depends on numpy and scipy.

Quickstart

from hb_irt.types import Item
from hb_irt.models.threepl import ThreePLModel
from hb_irt.bayes.estimation import eap_estimate
from hb_irt.bayes.sequential import sequential_update
from hb_irt.scoring import build_subskill_score

# Define a small item bank (discrimination, difficulty, guessing)
items = [Item(item_id=f"q{i}", a=1.2, b=b, c=0.2)
         for i, b in enumerate([-1.0, -0.3, 0.4, 1.0, 1.6])]
models = [ThreePLModel(item) for item in items]

# Stage 1: estimate ability from a prior N(0, 1)
posterior = eap_estimate(list(zip(models, [1, 1, 0, 1, 0])), prior_mu=0.0, prior_sigma=1.0)

# Stage 2: the previous posterior becomes the new prior
posterior = sequential_update(posterior, list(zip(models, [1, 1, 1, 0, 1])))

# Rescale to a 0-100 score with a 95% margin of error
score = build_subskill_score(
    "python_debugging", posterior, items_administered=10, modules_completed=2
)
print(f"{score.score_0_100:.1f} ± {score.margin_error_95:.1f}")

Concepts

  • Ability (θ) is represented on a logit scale internally, typically in the range [-4, 4]. Use hb_irt.scoring.rescale_0_100 to convert a posterior to a 0-100 score with a margin of error whenever you need to display it.
  • Posterior(mu, variance) is the shared representation of a belief about a candidate's ability throughout the library. It exposes .sem (standard error) and .credible_interval(level).
  • Item models (3PL, GRM, CRM) share a minimal common interface: loglik(value, theta) and info(theta) — which is what lets responses of every type combine into a single ability estimate.

Usage guide

Multiple-choice items (3PL)
from hb_irt.types import Item
from hb_irt.models.threepl import ThreePLModel

item = ThreePLModel(Item(item_id="q1", a=1.2, b=0.3, c=0.2))
item.probability(theta=0.3)   # probability of a correct response at theta=0.3
item.loglik(value=1, theta=0.3)
item.info(theta=0.3)          # Fisher information at theta=0.3
Open-ended answers scored 0-10 (Graded Response Model)
from hb_irt.models.grm import GRMItem, GRMModel

# 10 boundaries define 11 ordered categories (scores 0..10)
item = GRMModel(GRMItem(item_id="qa1", a=1.0, boundaries=(-2, -1, 0, 1, 2, 3, 4, 5, 6, 7)))
item.category_probabilities(theta=0.5)   # probability of each of the 11 scores
item.loglik(value=7, theta=0.5)          # value is the observed 0-10 score
item.info(theta=0.5)
Open-ended answers scored 0-100 (Continuous Response Model)
from hb_irt.models.crm import CRMItem, CRMModel

item = CRMModel(CRMItem(item_id="qa2", a=1.0, b=0.0, max_score=100.0))
item.loglik(value=72.0, theta=0.4)
item.info(theta=0.4)
Ability estimation (EAP / MAP) across mixed item types
from hb_irt.bayes.estimation import eap_estimate, map_estimate

# Any mix of item models works, since each just contributes a scalar
# loglik(value, theta) — MCQ, graded, and continuous responses combine freely.
responses = [(mcq_model, 1), (grm_model, 8), (crm_model, 80.0)]

posterior = eap_estimate(responses, prior_mu=0.0, prior_sigma=1.0)
# posterior.mu, posterior.variance, posterior.sem, posterior.credible_interval(0.95)

theta_map = map_estimate(responses, prior_mu=0.0, prior_sigma=1.0)
Sequential updating across test stages
from hb_irt.types import Posterior
from hb_irt.bayes.sequential import sequential_update, sequential_update_all

prior = Posterior(mu=0.0, variance=1.0)
stage_1_posterior = sequential_update(prior, stage_1_responses)
stage_2_posterior = sequential_update(stage_1_posterior, stage_2_responses)

# or in one call, given an ordered list of each stage's responses:
history = sequential_update_all(prior, [stage_1_responses, stage_2_responses])

Posterior variance is guaranteed to never increase across stages (assuming non-degenerate item information), so estimates only get more precise as a candidate answers more items.

Item calibration (MMLE/EM)

Fit 3PL item parameters from a batch of raw response data:

import numpy as np
from hb_irt.calibration import calibrate_3pl

responses = np.array(...)  # shape (n_examinees, n_items), binary 0/1
result = calibrate_3pl(responses, item_ids=["q1", "q2", "q3"], fixed_c=0.2)
result.items          # tuple of Item, with fitted discrimination/difficulty (and fixed guessing)
result.converged      # bool
result.n_iterations   # int

Pass fixed_c=<value> when calibrating with fewer than ~500 responses per item; otherwise omit it to freely estimate a guessing parameter per item.

Difficulty mapping by cognitive level
from hb_irt.bloom import difficulty_anchor, shrink_difficulty

difficulty_anchor("L4")  # -> 1.2  (an "Analysis"-level item's typical difficulty)

# Pull a noisy raw calibration estimate toward its level's typical difficulty,
# weighted by how confident each estimate is.
shrunk_b = shrink_difficulty(raw_difficulty=1.9, raw_variance=0.3, level="L4", sigma_b=0.4)
Score rescaling and topic aggregation
from hb_irt.scoring import aggregate_levels, rescale_0_100, build_subskill_score

# Combine estimates from several cognitive levels into one topic-level posterior
level_posterior = aggregate_levels(
    level_thetas={"L1": 0.4, "L2": 0.6, "L3": 0.5},
    level_variances={"L1": 0.05, "L2": 0.08, "L3": 0.06},
)

score, margin, ci_lower, ci_upper = rescale_0_100(level_posterior)

subskill_score = build_subskill_score(
    subskill_id="python_debugging",
    posterior=level_posterior,
    items_administered=42,
    modules_completed=4,
    level_thetas={"L1": 0.4, "L2": 0.6, "L3": 0.5},
)
# subskill_score.score_0_100, .margin_error_95, .ci_lower_95, .ci_upper_95, ...
Adaptive testing (MSAT): module bank, selection, stopping
from hb_irt.types import Posterior
from hb_irt.msat.module_bank import ModuleBank
from hb_irt.msat.selection import select_next_module
from hb_irt.msat.stopping import StoppingConfig, evaluate_stopping

bank = ModuleBank(modules=(easy_module, medium_module, hard_module, challenge_module))

current_posterior = Posterior(mu=0.2, variance=0.6)
administered = ["easy_1"]

next_module = select_next_module(bank, current_posterior, administered_ids=administered)

decision = evaluate_stopping(
    posterior=current_posterior,
    previous_posterior=prior_posterior,   # or None on the first module
    n_modules=2,
    n_items=15,
    config=StoppingConfig(),  # sigma_min=0.3, max_modules=8, min_items=20, delta_saturation=0.01
)
if decision.should_stop:
    print("stopping:", decision.reasons)  # e.g. ("precision_threshold",)

ModuleBank.available(administered_ids) returns modules not yet given to a candidate. select_next_module picks the module that maximizes expected information gain at the candidate's current ability estimate, with an exposure-control bonus that favors less-used modules.

Package layout

Module Provides
hb_irt.types Core data types: Item, Response, Posterior, TestModule, SubskillScore
hb_irt.models.threepl 3PL model for multiple-choice items
hb_irt.models.grm Graded Response Model for 0-10 scored answers
hb_irt.models.crm Continuous Response Model for 0-100 scored answers
hb_irt.bayes.estimation EAP and MAP ability estimation
hb_irt.bayes.sequential Sequential Bayesian updating across test stages
hb_irt.information Fisher test information and standard error of measurement
hb_irt.calibration MMLE/EM calibration of 3PL item parameters
hb_irt.bloom Cognitive-level difficulty anchors and shrinkage
hb_irt.scoring 0-100 rescaling and precision-weighted level aggregation
hb_irt.msat Adaptive module bank, selection, and stopping rules

Import directly from the submodule you need, e.g. from hb_irt.models.threepl import ThreePLModel.

Development

For contribution guidelines, architecture notes, and project conventions, see CLAUDE.md.

uv sync
uv run pytest

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hb_irt-0.1.2.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hb_irt-0.1.2-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file hb_irt-0.1.2.tar.gz.

File metadata

  • Download URL: hb_irt-0.1.2.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hb_irt-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4572bb7af605b48f6863e2d480b593738b162743e0ebabb390376acabdc359b7
MD5 7ccb6f88ba8fbf3ca43fbbaf7840bb79
BLAKE2b-256 cdcd201bcc2ad6929b5f6c327814f98accb668bf59ccb2acee2a4b6e7d45804b

See more details on using hashes here.

File details

Details for the file hb_irt-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: hb_irt-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for hb_irt-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 812c6dde8b8210b514a996fd3ef7430a738a0def600133eb36e33abc76d4e995
MD5 cde2d04186e2e9387b9a139bbb47d088
BLAKE2b-256 8ac5890b6e22a2a17ccad1aee7b4ad8b8881201f19e62adf3b811bcbd27d273b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page