Confusion matrices with uncertainty quantification, experiment aggregation and significance testing.
Project description
prob_conf_mat is a Python package for performing statistical inference with confusion matrices. It quantifies the amount of uncertainty present, aggregates semantically related experiments into experiment groups, and compares experiments against each other for significance.
Installation
Installation can be done using from pypi can be done using pip:
pip install prob_conf_mat
Or, if you're using uv, simply run:
uv add prob_conf_mat
The project currently depends on the following packages:
Dependency tree
prob-conf-mat
├── jaxtyping
├── matplotlib
├── numpy
├── scipy
└── tabulate
Additionally, pandas is an optional dependency for some reporting functions.
Development Environment
This project was developed using uv. To install the development environment, simply clone this github repo:
git clone https://github.com/ioverho/prob_conf_mat.git
And then run the uv sync --dev command:
uv sync --dev
The development dependencies should automatically install into the .venv folder.
Documentation
For more information about the package, motivation, how-to guides and implementation, please see the documentation website. We try to use Daniele Procida's structure for Python documentation.
The documentation is broadly divided into 4 sections:
- Getting Started: a collection of small tutorials to help new users get started
- How To: more expansive guides on how to achieve specific things
- Reference: in-depth information about how to interface with the library
- Explanation: explanations about why things are the way they are
| Learning | Coding | |
|---|---|---|
| Practical | Getting Started | How-To Guides |
| Theoretical | Explanation | Reference |
Quick Start
In depth tutorials taking you through all basic steps are available on the documentation site. For the impatient, here's a standard use case.
First define a study, and set some sensible hyperparameters for the simulated confusion matrices.
from prob_conf_mat import Study
study = Study(
seed=0,
num_samples=10000,
ci_probability=0.95,
)
Then add a experiment and confusion matrix to the study:
study.add_experiment(
experiment_name="model_1/fold_0",
confusion_matrix=[
[13, 0, 0],
[0, 10, 6],
[0, 0, 9],
],
confusion_prior=0,
prevalence_prior=1,
)
Finally, add some metrics to the study:
study.add_metric("acc")
We are now ready to start generating summary statistics about this experiment. For example:
study.report_metric_summaries(
metric="acc",
table_fmt="github"
)
| Group | Experiment | Observed | Median | Mode | 95.0% HDI | MU | Skew | Kurt |
|---|---|---|---|---|---|---|---|---|
| model_1 | fold_0 | 0.8421 | 0.8499 | 0.8673 | [0.7307, 0.9464] | 0.2157 | -0.5627 | 0.2720 |
So while this experiment achieves an accuracy of 84.21%, a more reasonable estimate (given the size of the test set, and) would be 84.99%. There is a 95% probability that the true accuracy lies between 73.07%-94.64%.
Visually that looks something like:
fig = study.plot_metric_summaries(metric="acc")
Now let's add a confusion matrix for the same model, but estimated using a different fold. We want to know what the average performance is for that model across the different folds:
study.add_experiment(
experiment_name="model_1/fold_1",
confusion_matrix=[
[12, 1, 0],
[1, 8, 7],
[0, 2, 7],
],
confusion_prior=0,
prevalence_prior=1,
)
We can equip each metric with an inter-experiment aggregation method, and we can then request summary statistics about the aggregate performance of the experiments using 'model_1':
study.add_metric(
metric="acc",
aggregation="beta",
)
fig = study.plot_forest_plot(metric="acc")
Note that estimated aggregate accuracy has much less uncertainty (a smaller HDI/MU).
These experiments seem pretty different. But is this difference significant? Let's assume that for this example a difference needs to be at least '0.05' to be considered significant. In that case, we can quickly request the probability of their difference:
fig = study.plot_pairwise_comparison(
metric="acc",
experiment_a="model_1/fold_0",
experiment_b="model_1/fold_1",
min_sig_diff=0.05,
)
There's about an 82% probability that the difference is in fact significant. While likely, there isn't quite enough data to be sure.
Development
This project was developed using the following (amazing) tools:
- Package management:
uv - Linting:
ruff - Static Type-Checking:
pyright - Documentation:
mkdocs - CI:
pre-commit
Most of the common development commands are included in ./Makefile. If make is installed, you can immediately run the following commands:
Usage:
make <target>
Utility
help Display this help
hello-world Tests uv and make
Environment
install Install default dependencies
install-dev Install dev dependencies
upgrade Upgrade installed dependencies
export Export uv to requirements.txt file
Testing, Linting, Typing & Formatting
test Runs all tests
coverage Checks test coverage
lint Run linting
type Run static typechecking
commit Run pre-commit checks
Documentation
mkdocs Update the docs
mkdocs-serve Serve documentation site
Credits
The following are some packages and libraries which served as inspiration for aspects of this project: arviz, bayestestR, BERTopic, jaxtyping, mici, , python-ci, statsmodels.
A lot of the approaches and methods used in this project come from published works. Some especially important works include:
- Goutte, C., & Gaussier, E. (2005). A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European conference on information retrieval (pp. 345-359). Berlin, Heidelberg: Springer Berlin Heidelberg.
- Tötsch, N., & Hoffmann, D. (2021). Classifier uncertainty: evidence, potential impact, and probabilistic treatment. PeerJ Computer Science, 7, e398.
- Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573.
- Makowski, D., Ben-Shachar, M. S., Chen, S. A., & Lüdecke, D. (2019). Indices of effect existence and significance in the Bayesian framework. Frontiers in psychology, 10, 2767.
- Hill, T. (2011). Conflations of probability distributions. Transactions of the American Mathematical Society, 363(6), 3351-3372.
- Chandler, J., Cumpston, M., Li, T., Page, M. J., & Welch, V. J. H. W. (2019). Cochrane handbook for systematic reviews of interventions. Hoboken: Wiley, 4.
Citation
@software{ioverho_prob_conf_mat,
author = {Verhoeven, Ivo},
license = {MIT},
title = {{prob\_conf\_mat}},
url = {https://github.com/ioverho/prob_conf_mat}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prob_conf_mat-0.3.0.tar.gz.
File metadata
- Download URL: prob_conf_mat-0.3.0.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1b7695c3afb03b169f57e0e39bdeac2c3a6e05a3e7271f48019f262ccb47d39
|
|
| MD5 |
761293a4ebb51cbe799f6b5712e4d5cf
|
|
| BLAKE2b-256 |
5de5edd7f073189dc2b598c711c65fe7ccb196db6ccd4518ced294e223d4b779
|
File details
Details for the file prob_conf_mat-0.3.0-py3-none-any.whl.
File metadata
- Download URL: prob_conf_mat-0.3.0-py3-none-any.whl
- Upload date:
- Size: 90.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c89e35bc474a65031c643f8036645fb1eee653d6fbcfb6e4142e60084090b7b
|
|
| MD5 |
0e6f5405be1944eb7e6feb05f3d04999
|
|
| BLAKE2b-256 |
628f11ff2ba6a71fbfadb33de6b1794f414ff549f846c03d7ff28747bc0635f1
|