Human-AI hybrid inter-rater reliability tool for cognitive error taxonomy classification in MCQ distractors.
Project description
ConfusionMapper
A Python tool for classifying wrong answer choices in multiple-choice questions by the type of cognitive error they represent, and for computing how well a human researcher and an AI rater agree on those labels.
What it does
In educational research, you cannot use qualitative labels in a statistical analysis unless two independent raters can produce roughly the same labels on the same items. The usual way to check this is Cohen's kappa.
ConfusionMapper handles the full workflow. You go through a set of distractors one by one and pick a label for each. If you have an OpenAI API key, an AI rater labels the same distractors in parallel. At the end, the program computes Cohen's kappa, draws a 4x4 confusion matrix of where the two raters agreed and disagreed, and saves everything to JSON.
It was built as the reliability check for a preregistered RCT on cognitive error feedback in government junior high schools (OSF: https://doi.org/10.17605/OSF.IO/YU6P5). Data collection in that study could only start once kappa cleared 0.70.
The four error types
The taxonomy is called the Confusion Fingerprint Index (CFI). It splits wrong answers into four categories:
| Code | Name | What it looks like |
|---|---|---|
| RF | Recall Failure | No memory trace at all; the answer is essentially random |
| PK | Partial Knowledge | Almost right; the direction is correct but the model is incomplete |
| CF | Confabulation | A coherent misconception held with confidence |
| INT | Interference | A correct answer pulled from the wrong topic |
Install
Once a versioned release is on PyPI:
pip install confusion-mapper
From source (until the first PyPI release, this is the recommended path):
git clone https://github.com/Manik-Maurya/Confusion-Mapper.git
cd Confusion-Mapper
pip install -e .
The only runtime dependency is openai>=1.0.0. tkinter ships with Python on most systems; on a minimal Debian or Ubuntu image install it with sudo apt-get install python3-tk. If tkinter is missing, ConfusionMapper drops to console output instead of the GUI.
Python 3.9 or higher.
For the test extras:
pip install -e ".[dev]"
Run
Without an API key (you label everything by hand):
python confusion_mapper.py
With the AI rater on:
# macOS / Linux
export OPENAI_API_KEY="your-key-here"
python confusion_mapper.py
# Windows
set OPENAI_API_KEY=your-key-here
python confusion_mapper.py
For each distractor, type 1 for RF, 2 for PK, 3 for CF, or 4 for INT. When you finish, the dashboard opens and the session writes itself to JSON.
Case study
A fully reproducible worked example lives in case_study/. It runs the entire
pipeline (nominal kappa, weighted kappa under linear and quadratic schemes, BCa
bootstrap 95% CI, confusion matrix, per-category stats) on the bundled 30-item
paired-label set with a fixed seed, then writes a publication-ready bundle to
case_study/results/ (JSON summary, CSV matrix, CSV per-type stats, full bootstrap
distribution, and a Markdown report). Regenerate with one command:
python case_study/run_case_study.py
Headline result on the bundled data: nominal kappa = 0.8653, BCa 95% CI = (0.6856, 1.0000), pre-registration gate PASSES at the 0.70 threshold.
Headless example
A non-interactive demo runs the three core functions on 30 pre-labelled distractors in under a second. No API key, no display:
python examples/demo.py
It prints kappa, the full confusion matrix, per-category agreement, and a PASS / HOLD verdict against the 0.70 gate. Swap in your own paired labels by replacing sample_data/example_labels.csv.
Tests
pip install -e ".[dev]"
pytest tests/ -v
55 tests, runs in under a second. Covers the kappa formula (with hand-verified worked examples), weighted kappa (nominal / linear / quadratic), bootstrap confidence intervals (percentile + BCa, seedable), custom-taxonomy loading, the confusion matrix, and the mathematical invariants that tie everything together. Doesn't need an API key or a display.
Advanced features
All three are stdlib-only and tested.
Weighted kappa for ordinal taxonomies. Pass weights="linear" or
weights="quadratic" to penalise far-apart disagreements more than adjacent ones:
from confusion_mapper import compute_cohens_kappa
r = compute_cohens_kappa(human, ai, weights="linear")
print(r["kappa"], r["weights"])
Bootstrap 95% confidence interval (BCa or percentile). Cohen's kappa is a point
estimate; this gives you uncertainty around it. The seed makes the interval
bit-identically reproducible:
from confusion_mapper import bootstrap_kappa_ci
ci = bootstrap_kappa_ci(human, ai, n_resamples=10000, method="bca", seed=42)
print(f"kappa = {ci['point_estimate']} (95% CI {ci['ci_lower']} to {ci['ci_upper']})")
Custom taxonomy via JSON. Swap the default CFI categories for any 2-or-more
category nominal scheme. A working example sits at
sample_data/custom_taxonomy.json:
from confusion_mapper import load_taxonomy_from_json, compute_cohens_kappa
codes, tax = load_taxonomy_from_json("sample_data/custom_taxonomy.json")
r = compute_cohens_kappa(human, ai, categories=codes)
Why a 4x4 confusion matrix instead of just kappa
The single kappa number tells you how much you and the AI agree overall. It does not tell you which category boundary is causing the disagreement. CF vs INT is the hardest distinction in the CFI taxonomy (a confident wrong belief looks a lot like a correct answer applied to the wrong topic), and the [CF, INT] cell of the confusion matrix is where most disagreement tends to land. Looking at the full 4x4 matrix lets you see exactly that, and rewrite the AI prompt or your own rubric until the cell stops glowing.
A note on ethics
The AI rater is a starting point, not a substitute for a second human. For published research you should still run kappa between two trained human raters on the same calibration set. Treat the AI labels as a way to surface boundaries that need rubric work, not as a replacement for human judgment.
How to cite
Archived release on Zenodo: https://doi.org/10.5281/zenodo.20807432
APA:
Maurya, M. (2026). ConfusionMapper: A Python Tool for AI-Assisted Distractor Classification and Inter-Rater Reliability Computation in Cognitive Error Taxonomy Research (Version 1.0.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.20807432
BibTeX:
@software{maurya2026confusionmapper,
author = {Maurya, Manik},
title = {ConfusionMapper: A Python Tool for AI-Assisted Distractor Classification
and Inter-Rater Reliability Computation in Cognitive Error Taxonomy Research},
year = {2026},
version = {1.0.0},
doi = {10.5281/zenodo.20807432},
url = {https://doi.org/10.5281/zenodo.20807432}
}
Contributing
Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md and CODE_OF_CONDUCT.md.
Acknowledgements
Initial development was completed as part of Stanford Code in Place 2026. The tool is used in the Confusion Fingerprint Index research programme at the Department of Cognitive Science, IIT Kanpur.
License
MIT License © 2026 Manik Maurya. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file confusion_mapper-1.0.0.tar.gz.
File metadata
- Download URL: confusion_mapper-1.0.0.tar.gz
- Upload date:
- Size: 42.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97ef7a9b95526261b83da93f6e2de19dcdc606ce0545b3a06b6a8e09868a6762
|
|
| MD5 |
e4b5261a0d3147728e7b438558eba92f
|
|
| BLAKE2b-256 |
16dec930635aeeb0b87161ab74d350a5c0d8009b507962989b08f8c335590bbc
|
Provenance
The following attestation bundles were made for confusion_mapper-1.0.0.tar.gz:
Publisher:
publish.yml on Manik-Maurya/Confusion-Mapper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
confusion_mapper-1.0.0.tar.gz -
Subject digest:
97ef7a9b95526261b83da93f6e2de19dcdc606ce0545b3a06b6a8e09868a6762 - Sigstore transparency entry: 1947571514
- Sigstore integration time:
-
Permalink:
Manik-Maurya/Confusion-Mapper@d9ad72e151084f959a6c164ca660f59ea028ed49 -
Branch / Tag:
refs/tags/v2.5.0 - Owner: https://github.com/Manik-Maurya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d9ad72e151084f959a6c164ca660f59ea028ed49 -
Trigger Event:
release
-
Statement type:
File details
Details for the file confusion_mapper-1.0.0-py3-none-any.whl.
File metadata
- Download URL: confusion_mapper-1.0.0-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d1ec672aba3fccea74344c831fd3ffb5061c59ada4144c75315d9b0a28e192f
|
|
| MD5 |
09ed3d93e8e50838a829a7e172720f7d
|
|
| BLAKE2b-256 |
b015633ec80ce110e6fd22f4e3ad0eedaa1a8851dbfc7a22f33e1de979ea0f4b
|
Provenance
The following attestation bundles were made for confusion_mapper-1.0.0-py3-none-any.whl:
Publisher:
publish.yml on Manik-Maurya/Confusion-Mapper
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
confusion_mapper-1.0.0-py3-none-any.whl -
Subject digest:
5d1ec672aba3fccea74344c831fd3ffb5061c59ada4144c75315d9b0a28e192f - Sigstore transparency entry: 1947571620
- Sigstore integration time:
-
Permalink:
Manik-Maurya/Confusion-Mapper@d9ad72e151084f959a6c164ca660f59ea028ed49 -
Branch / Tag:
refs/tags/v2.5.0 - Owner: https://github.com/Manik-Maurya
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d9ad72e151084f959a6c164ca660f59ea028ed49 -
Trigger Event:
release
-
Statement type: