Human-AI hybrid inter-rater reliability tool for cognitive error taxonomy classification in MCQ distractors.

These details have not been verified by PyPI

Project links

Archive

Project description

ConfusionMapper

A Python tool for classifying wrong answer choices in multiple-choice questions by the type of cognitive error they represent, and for computing how well a human researcher and an AI rater agree on those labels.

What it does

In educational research, you cannot use qualitative labels in a statistical analysis unless two independent raters can produce roughly the same labels on the same items. The usual way to check this is Cohen's kappa.

ConfusionMapper handles the full workflow. You go through a set of distractors one by one and pick a label for each. If you have an OpenAI API key, an AI rater labels the same distractors in parallel. At the end, the program computes Cohen's kappa, draws a 4x4 confusion matrix of where the two raters agreed and disagreed, and saves everything to JSON.

It was built as the reliability check for a preregistered RCT on cognitive error feedback in government junior high schools (OSF: https://doi.org/10.17605/OSF.IO/YU6P5). Data collection in that study could only start once kappa cleared 0.70.

The four error types

The taxonomy is called the Confusion Fingerprint Index (CFI). It splits wrong answers into four categories:

Code	Name	What it looks like
RF	Recall Failure	No memory trace at all; the answer is essentially random
PK	Partial Knowledge	Almost right; the direction is correct but the model is incomplete
CF	Confabulation	A coherent misconception held with confidence
INT	Interference	A correct answer pulled from the wrong topic

Install

Once a versioned release is on PyPI:

pip install confusion-mapper

From source (until the first PyPI release, this is the recommended path):

git clone https://github.com/Manik-Maurya/Confusion-Mapper.git
cd Confusion-Mapper
pip install -e .

The only runtime dependency is openai>=1.0.0. tkinter ships with Python on most systems; on a minimal Debian or Ubuntu image install it with sudo apt-get install python3-tk. If tkinter is missing, ConfusionMapper drops to console output instead of the GUI.

Python 3.9 or higher.

For the test extras:

pip install -e ".[dev]"

Run

Without an API key (you label everything by hand):

python confusion_mapper.py

With the AI rater on:

# macOS / Linux
export OPENAI_API_KEY="your-key-here"
python confusion_mapper.py

# Windows
set OPENAI_API_KEY=your-key-here
python confusion_mapper.py

For each distractor, type 1 for RF, 2 for PK, 3 for CF, or 4 for INT. When you finish, the dashboard opens and the session writes itself to JSON.

Case study

A fully reproducible worked example lives in case_study/. It runs the entire pipeline (nominal kappa, weighted kappa under linear and quadratic schemes, BCa bootstrap 95% CI, confusion matrix, per-category stats) on the bundled 30-item paired-label set with a fixed seed, then writes a publication-ready bundle to case_study/results/ (JSON summary, CSV matrix, CSV per-type stats, full bootstrap distribution, and a Markdown report). Regenerate with one command:

python case_study/run_case_study.py

Headline result on the bundled data: nominal kappa = 0.8653, BCa 95% CI = (0.6856, 1.0000), pre-registration gate PASSES at the 0.70 threshold.

Headless example

A non-interactive demo runs the three core functions on 30 pre-labelled distractors in under a second. No API key, no display:

python examples/demo.py

It prints kappa, the full confusion matrix, per-category agreement, and a PASS / HOLD verdict against the 0.70 gate. Swap in your own paired labels by replacing sample_data/example_labels.csv.

Tests

pip install -e ".[dev]"
pytest tests/ -v

55 tests, runs in under a second. Covers the kappa formula (with hand-verified worked examples), weighted kappa (nominal / linear / quadratic), bootstrap confidence intervals (percentile + BCa, seedable), custom-taxonomy loading, the confusion matrix, and the mathematical invariants that tie everything together. Doesn't need an API key or a display.

Advanced features

All three are stdlib-only and tested.

Weighted kappa for ordinal taxonomies. Pass weights="linear" or weights="quadratic" to penalise far-apart disagreements more than adjacent ones:

from confusion_mapper import compute_cohens_kappa
r = compute_cohens_kappa(human, ai, weights="linear")
print(r["kappa"], r["weights"])

Bootstrap 95% confidence interval (BCa or percentile). Cohen's kappa is a point estimate; this gives you uncertainty around it. The seed makes the interval bit-identically reproducible:

from confusion_mapper import bootstrap_kappa_ci
ci = bootstrap_kappa_ci(human, ai, n_resamples=10000, method="bca", seed=42)
print(f"kappa = {ci['point_estimate']} (95% CI {ci['ci_lower']} to {ci['ci_upper']})")

Custom taxonomy via JSON. Swap the default CFI categories for any 2-or-more category nominal scheme. A working example sits at sample_data/custom_taxonomy.json:

from confusion_mapper import load_taxonomy_from_json, compute_cohens_kappa
codes, tax = load_taxonomy_from_json("sample_data/custom_taxonomy.json")
r = compute_cohens_kappa(human, ai, categories=codes)

Why a 4x4 confusion matrix instead of just kappa

The single kappa number tells you how much you and the AI agree overall. It does not tell you which category boundary is causing the disagreement. CF vs INT is the hardest distinction in the CFI taxonomy (a confident wrong belief looks a lot like a correct answer applied to the wrong topic), and the [CF, INT] cell of the confusion matrix is where most disagreement tends to land. Looking at the full 4x4 matrix lets you see exactly that, and rewrite the AI prompt or your own rubric until the cell stops glowing.

A note on ethics

The AI rater is a starting point, not a substitute for a second human. For published research you should still run kappa between two trained human raters on the same calibration set. Treat the AI labels as a way to surface boundaries that need rubric work, not as a replacement for human judgment.

How to cite

Archived release on Zenodo: https://doi.org/10.5281/zenodo.20807432

APA:

Maurya, M. (2026). ConfusionMapper: A Python Tool for AI-Assisted Distractor Classification and Inter-Rater Reliability Computation in Cognitive Error Taxonomy Research (Version 1.0.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.20807432

BibTeX:

@software{maurya2026confusionmapper,
  author    = {Maurya, Manik},
  title     = {ConfusionMapper: A Python Tool for AI-Assisted Distractor Classification
               and Inter-Rater Reliability Computation in Cognitive Error Taxonomy Research},
  year      = {2026},
  version   = {1.0.0},
  doi       = {10.5281/zenodo.20807432},
  url       = {https://doi.org/10.5281/zenodo.20807432}
}

Contributing

Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md and CODE_OF_CONDUCT.md.

Acknowledgements

Initial development was completed as part of Stanford Code in Place 2026. The tool is used in the Confusion Fingerprint Index research programme at the Department of Cognitive Science, IIT Kanpur.

License

Project details

These details have not been verified by PyPI

Project links

Archive

Release history Release notifications | RSS feed

This version

1.0.0

Jun 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

confusion_mapper-1.0.0.tar.gz (42.3 kB view details)

Uploaded Jun 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

confusion_mapper-1.0.0-py3-none-any.whl (22.4 kB view details)

Uploaded Jun 25, 2026 Python 3

File details

Details for the file confusion_mapper-1.0.0.tar.gz.

File metadata

Download URL: confusion_mapper-1.0.0.tar.gz
Upload date: Jun 25, 2026
Size: 42.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for confusion_mapper-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`97ef7a9b95526261b83da93f6e2de19dcdc606ce0545b3a06b6a8e09868a6762`
MD5	`e4b5261a0d3147728e7b438558eba92f`
BLAKE2b-256	`16dec930635aeeb0b87161ab74d350a5c0d8009b507962989b08f8c335590bbc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for confusion_mapper-1.0.0.tar.gz:

Publisher: publish.yml on Manik-Maurya/Confusion-Mapper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: confusion_mapper-1.0.0.tar.gz
- Subject digest: 97ef7a9b95526261b83da93f6e2de19dcdc606ce0545b3a06b6a8e09868a6762
- Sigstore transparency entry: 1947571514
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: Manik-Maurya/Confusion-Mapper@d9ad72e151084f959a6c164ca660f59ea028ed49
- Branch / Tag: refs/tags/v2.5.0
- Owner: https://github.com/Manik-Maurya
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d9ad72e151084f959a6c164ca660f59ea028ed49
- Trigger Event: release

File details

Details for the file confusion_mapper-1.0.0-py3-none-any.whl.

File metadata

Download URL: confusion_mapper-1.0.0-py3-none-any.whl
Upload date: Jun 25, 2026
Size: 22.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for confusion_mapper-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d1ec672aba3fccea74344c831fd3ffb5061c59ada4144c75315d9b0a28e192f`
MD5	`09ed3d93e8e50838a829a7e172720f7d`
BLAKE2b-256	`b015633ec80ce110e6fd22f4e3ad0eedaa1a8851dbfc7a22f33e1de979ea0f4b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for confusion_mapper-1.0.0-py3-none-any.whl:

Publisher: publish.yml on Manik-Maurya/Confusion-Mapper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: confusion_mapper-1.0.0-py3-none-any.whl
- Subject digest: 5d1ec672aba3fccea74344c831fd3ffb5061c59ada4144c75315d9b0a28e192f
- Sigstore transparency entry: 1947571620
- Sigstore integration time: Jun 25, 2026
Source repository:
- Permalink: Manik-Maurya/Confusion-Mapper@d9ad72e151084f959a6c164ca660f59ea028ed49
- Branch / Tag: refs/tags/v2.5.0
- Owner: https://github.com/Manik-Maurya
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d9ad72e151084f959a6c164ca660f59ea028ed49
- Trigger Event: release

confusion-mapper 1.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ConfusionMapper

What it does

The four error types

Install

Run

Case study

Headless example

Tests

Advanced features

Why a 4x4 confusion matrix instead of just kappa

A note on ethics

How to cite

Contributing

Acknowledgements

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance