Confusion matrices with uncertainty quantification, experiment aggregation and significance testing.

These details have not been verified by PyPI

Project links

Project description

Probabilistic Confusion Matrices

prob_conf_mat is a Python package for performing statistical inference with confusion matrices. It quantifies the amount of uncertainty present, aggregates semantically related experiments into experiment groups, and compares experiments against each other for significance.

Installation

Installation can be done using from pypi can be done using pip:

pip install prob_conf_mat

Or, if you're using uv, simply run:

uv add prob_conf_mat

The project currently depends on the following packages:

Dependency tree

bayes-conf-mat
├── jaxtyping v0.3.2
├── matplotlib v3.10.3
├── numpy v2.3.0
├── scipy v1.15.3
├── seaborn v0.13.2
│   └── pandas v2.3.0
└── tabulate v0.9.0

Development Environment

This project was developed using uv. To install the development environment, simply clone this github repo:

git clone https://github.com/ioverho/prob_conf_mat.git

And then run the uv sync --dev command:

uv sync --dev

The development dependencies should automatically install into the .venv folder.

Documentation

For more information about the package, motivation, how-to guides and implementation, please see the documentation website. We try to use Daniele Procida's structure for Python documentation.

The documentation is broadly divided into 4 sections:

Getting Started: a collection of small tutorials to help new users get started
How To: more expansive guides on how to achieve specific things
Reference: in-depth information about how to interface with the library
Explanation: explanations about why things are the way they are

	Learning	Coding
Practical	Getting Started	How-To Guides
Theoretical	Explanation	Reference

Quick Start

In depth tutorials taking you through all basic steps are available on the documentation site. For the impatient, here's a standard use case.

First define a study, and set some sensible hyperparameters for the simulated confusion matrices.

from prob_conf_mat import Study

study = Study(
    seed=0,
    num_samples=10000,
    ci_probability=0.95,
)

Then add a experiment and confusion matrix to the study:

study.add_experiment(
  experiment_name="model_1/fold_0",
  confusion_matrix=[
    [13, 0, 0],
    [0, 10, 6],
    [0,  0, 9],
  ],
  confusion_prior=0,
  prevalence_prior=1,
)

Finally, add some metrics to the study:

study.add_metric("acc")

We are now ready to start generating summary statistics about this experiment. For example:

study.report_metric_summaries(
  metric="acc",
  table_fmt="github"
)

Group	Experiment	Observed	Median	Mode	95.0% HDI	MU	Skew	Kurt
model_1	fold_0	0.8421	0.8499	0.8673	[0.7307, 0.9464]	0.2157	-0.5627	0.2720

So while this experiment achieves an accuracy of 84.21%, a more reasonable estimate (given the size of the test set, and) would be 84.99%. There is a 95% probability that the true accuracy lies between 73.07%-94.64%.

Visually that looks something like:

fig = study.plot_metric_summaries(metric="acc")

Now let's add a confusion matrix for the same model, but estimated using a different fold. We want to know what the average performance is for that model across the different folds:

study.add_experiment(
  experiment_name="model_1/fold_1",
  confusion_matrix=[
      [12, 1, 0],
      [1, 8, 7],
      [0, 2, 7],
  ],
  confusion_prior=0,
  prevalence_prior=1,
)

We can equip each metric with an inter-experiment aggregation method, and we can then request summary statistics about the aggregate performance of the experiments using 'model_1':

study.add_metric(
    metric="acc",
    aggregation="beta",
)

fig = study.plot_forest_plot(metric="acc")

Note that estimated aggregate accuracy has much less uncertainty (a smaller HDI/MU).

These experiments seem pretty different. But is this difference significant? Let's assume that for this example a difference needs to be at least '0.05' to be considered significant. In that case, we can quickly request the probability of their difference:

fig = study.plot_pairwise_comparison(
    metric="acc",
    experiment_a="model_1/fold_0",
    experiment_b="model_1/fold_1",
    min_sig_diff=0.05,
)

There's about an 82% probability that the difference is in fact significant. While likely, there isn't quite enough data to be sure.

Development

This project was developed using the following (amazing) tools:

Package management: uv
Linting: ruff
Static Type-Checking: pyright
Documentation: mkdocs
CI: pre-commit

Most of the common development commands are included in ./Makefile. If make is installed, you can immediately run the following commands:

Usage:
  make <target>

Utility
  help             Display this help
  hello-world      Tests uv and make

Environment
  install          Install default dependencies
  install-dev      Install dev dependencies
  upgrade          Upgrade installed dependencies
  export           Export uv to requirements.txt file

Testing, Linting, Typing & Formatting
  test             Runs all tests
  coverage         Checks test coverage
  lint             Run linting
  type             Run static typechecking
  commit           Run pre-commit checks

Documentation
  mkdocs           Update the docs
  mkdocs-serve     Serve documentation site

Credits

The following are some packages and libraries which served as inspiration for aspects of this project: arviz, bayestestR, BERTopic, jaxtyping, mici, , python-ci, statsmodels.

A lot of the approaches and methods used in this project come from published works. Some especially important works include:

Goutte, C., & Gaussier, E. (2005). A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European conference on information retrieval (pp. 345-359). Berlin, Heidelberg: Springer Berlin Heidelberg.
Tötsch, N., & Hoffmann, D. (2021). Classifier uncertainty: evidence, potential impact, and probabilistic treatment. PeerJ Computer Science, 7, e398.
Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573.
Makowski, D., Ben-Shachar, M. S., Chen, S. A., & Lüdecke, D. (2019). Indices of effect existence and significance in the Bayesian framework. Frontiers in psychology, 10, 2767.
Hill, T. (2011). Conflations of probability distributions. Transactions of the American Mathematical Society, 363(6), 3351-3372.
Chandler, J., Cumpston, M., Li, T., Page, M. J., & Welch, V. J. H. W. (2019). Cochrane handbook for systematic reviews of interventions. Hoboken: Wiley, 4.

Citation

@software{ioverho_prob_conf_mat,
    author = {Verhoeven, Ivo},
    license = {MIT},
    title = {{prob\_conf\_mat}},
    url = {https://github.com/ioverho/prob_conf_mat}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

Feb 4, 2026

0.3.0

Jan 21, 2026

0.2.0

Dec 12, 2025

0.1.0

Aug 6, 2025

0.1.0rc5 pre-release

Jul 18, 2025

0.1.0rc4 pre-release

Jul 14, 2025

0.1.0rc3 pre-release

Jul 10, 2025

This version

0.1.0rc2 pre-release

Jul 3, 2025

0.1.0rc1 pre-release

Jul 1, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prob_conf_mat-0.1.0rc2.tar.gz (1.1 MB view details)

Uploaded Jul 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prob_conf_mat-0.1.0rc2-py3-none-any.whl (97.7 kB view details)

Uploaded Jul 3, 2025 Python 3

File details

Details for the file prob_conf_mat-0.1.0rc2.tar.gz.

File metadata

Download URL: prob_conf_mat-0.1.0rc2.tar.gz
Upload date: Jul 3, 2025
Size: 1.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for prob_conf_mat-0.1.0rc2.tar.gz
Algorithm	Hash digest
SHA256	`48320351796dc475e0f3e3b0f41ae17dee15b585dea4be52f7dbd9f3f05fa8ac`
MD5	`4664e4a5d2360e9f1cad1473ccd1f7d5`
BLAKE2b-256	`28e02201f18a61532a1e50158e990c3940bbe06a18e137319af0bd973b53c356`

See more details on using hashes here.

File details

Details for the file prob_conf_mat-0.1.0rc2-py3-none-any.whl.

File metadata

Download URL: prob_conf_mat-0.1.0rc2-py3-none-any.whl
Upload date: Jul 3, 2025
Size: 97.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.13

File hashes

Hashes for prob_conf_mat-0.1.0rc2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a5be7f50f0c3e90aee51cb55c41296fd71eb487ab0f482439803eeee9caab3a0`
MD5	`763f52b08da26888d4bf6fdc0f6a6820`
BLAKE2b-256	`ccce7fe928cdac6f842b3bd1098a902f80d4f5746cbaef2f9037918c2a57b568`

See more details on using hashes here.

prob-conf-mat 0.1.0rc2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Probabilistic Confusion Matrices

Installation

Development Environment

Documentation

Quick Start

Development

Credits

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes