Generate diverse score-target datasets with the same AUROC value, ala Anscombe's Quartet
Project description
roctet
[!TIP] Looking for a quickstart? Check out the demo notebook
The famous Anscombe's Quartet dataset (and its modern cousin, the Datasaurus Dozen) features different datasets with shared summary statistics and regression lines.
It serves as a cautionary illustration of the importance of EDA. The excellent R package {quartets} further catalogues such datasets which share superficial similarities while masking different fundamentals.
roctet provides the similar ability to generate numerous datasets consisting of a predictive score and binary target which all have the same AUROC but vary substantially in ROC curve shapes, precision, recall, and other model evaluation metrics.
Returned datasets may be useful for teaching purposes or testing the relationship between different model evaluation metrics.
Methodology
roctet creates ROC curves with a fixed AUC using either of two parameterization: Beta or Piecewise.
After creating an ROC curve, the same methodology is used to map back from the ROC scores to simulated prediction / target pairs.
Beta
The ROC curve is simulated with the CDF of the Beta(a,b) distribution. While there is no theoretical link between Beta distributions and AUROC, the Beta CDF has many convenient properties for simulating AUROC:
- it is continuous, monotonic, and concave
- has a domain and range between 0 and 1
- has a closed-form AUC characterized solely by the ratio of parameters
r = b/a - can deliver a range of shapes controlled with
b+a
For a given AUROC and control (b+a), roctet solves for the parameters of the Beta distribution and treats the resulting CDF as the ROC curve. This creates a slightly atypical ROC shape at the extremes, but is sufficient for a toy example.
Piecewise Linear
The ROC curve is simulated as a piecewise linear function with a single inflection point at (x,y). That is, the ROC curve is defined by:
tpr = (y/x) * fprforfpr < xtpr = y + ((y-tpr)/(x-fpr)) * (fpr - x)forfpr >= x
This creates an ROC curve with an atypical "sharp bend" but, once again, is sufficient for a toy example.
Score Derivation
Given a ROC curve, curves are derive in three steps:
- Simulate a set of
(fpr,tpr)points on the ROC curve - Calculate the implied number of True Positives and True Negatives in each score "band"
- Randomly generate scores and assign target values within each bin
Precision of matching AUCs is controlled by the sample size and number of bins used.
Usage
To get started, jump in to generate some datasets:
from roctet import calc_roctet
dfs = calc_roctet(auroc = 0.67, n_sets = 10)
dfs[0].glimpse()
Installation
Install from GitHub:
python -m pip install "git+https://github.com/emilyriederer/roctet.git"
Install a specific release tag:
python -m pip install "git+https://github.com/emilyriederer/roctet.git@v0.1.0"
Developer / editable install:
git clone https://github.com/emilyriederer/roctet.git
cd roctet
uv sync
uv pip install -e .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file roctet-0.1.0.tar.gz.
File metadata
- Download URL: roctet-0.1.0.tar.gz
- Upload date:
- Size: 7.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4cb0da716a615d34dfd87dcdc9ddf10705e90024150beba88516743d238be002
|
|
| MD5 |
09f0d0f8e5b4afce973555e8c6540252
|
|
| BLAKE2b-256 |
20e8bcfe6eb3879e2d71b337f3edb2190fc829d3113ec4129375fbc46b55f172
|
File details
Details for the file roctet-0.1.0-py3-none-any.whl.
File metadata
- Download URL: roctet-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85e18d12af87b1ca5688324b4198d92b3706ae41dd673f10afcf0b9a7a24501b
|
|
| MD5 |
82a445ff6cb9fcb5359166e38ee6fe4c
|
|
| BLAKE2b-256 |
d30a9fccf61cfbb1fbac35fcfef9397088536e995ae8641d1c563afdc21225e8
|