Skip to main content

Generate diverse score-target datasets with the same AUROC value, ala Anscombe's Quartet

Project description

roctet

Stability: Experimental

[!TIP] Looking for a quickstart? Check out the demo notebook

The famous Anscombe's Quartet dataset (and its modern cousin, the Datasaurus Dozen) features different datasets with shared summary statistics and regression lines. It serves as a cautionary illustration of the importance of EDA. The excellent R package {quartets} further catalogues such datasets which share superficial similarities while masking different fundamentals.

roctet provides the similar ability to generate numerous datasets consisting of a predictive score and binary target which all have the same AUROC but vary substantially in ROC curve shapes, precision, recall, and other model evaluation metrics.

Returned datasets may be useful for teaching purposes or testing the relationship between different model evaluation metrics.

Methodology

roctet creates ROC curves with a fixed AUC using either of two parameterization: Beta or Piecewise.

After creating an ROC curve, the same methodology is used to map back from the ROC scores to simulated prediction / target pairs.

Beta

The ROC curve is simulated with the CDF of the Beta(a,b) distribution. While there is no theoretical link between Beta distributions and AUROC, the Beta CDF has many convenient properties for simulating AUROC:

  • it is continuous, monotonic, and concave
  • has a domain and range between 0 and 1
  • has a closed-form AUC characterized solely by the ratio of parameters r = b/a
  • can deliver a range of shapes controlled with b+a

For a given AUROC and control (b+a), roctet solves for the parameters of the Beta distribution and treats the resulting CDF as the ROC curve. This creates a slightly atypical ROC shape at the extremes, but is sufficient for a toy example.

Piecewise Linear

The ROC curve is simulated as a piecewise linear function with a single inflection point at (x,y). That is, the ROC curve is defined by:

  • tpr = (y/x) * fpr for fpr < x
  • tpr = y + ((y-tpr)/(x-fpr)) * (fpr - x) for fpr >= x

This creates an ROC curve with an atypical "sharp bend" but, once again, is sufficient for a toy example.

Score Derivation

Given a ROC curve, curves are derive in three steps:

  • Simulate a set of (fpr,tpr) points on the ROC curve
  • Calculate the implied number of True Positives and True Negatives in each score "band"
  • Randomly generate scores and assign target values within each bin

Precision of matching AUCs is controlled by the sample size and number of bins used.

Usage

To get started, jump in to generate some datasets:

from roctet import calc_roctet

dfs = calc_roctet(auroc = 0.67, n_sets = 10)
dfs[0].glimpse()

Installation

Install from GitHub:

python -m pip install "git+https://github.com/emilyriederer/roctet.git"

Install a specific release tag:

python -m pip install "git+https://github.com/emilyriederer/roctet.git@v0.1.0"

Developer / editable install:

git clone https://github.com/emilyriederer/roctet.git
cd roctet
uv sync
uv pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

roctet-0.1.0.tar.gz (7.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

roctet-0.1.0-py3-none-any.whl (9.5 kB view details)

Uploaded Python 3

File details

Details for the file roctet-0.1.0.tar.gz.

File metadata

  • Download URL: roctet-0.1.0.tar.gz
  • Upload date:
  • Size: 7.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for roctet-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4cb0da716a615d34dfd87dcdc9ddf10705e90024150beba88516743d238be002
MD5 09f0d0f8e5b4afce973555e8c6540252
BLAKE2b-256 20e8bcfe6eb3879e2d71b337f3edb2190fc829d3113ec4129375fbc46b55f172

See more details on using hashes here.

File details

Details for the file roctet-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: roctet-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for roctet-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 85e18d12af87b1ca5688324b4198d92b3706ae41dd673f10afcf0b9a7a24501b
MD5 82a445ff6cb9fcb5359166e38ee6fe4c
BLAKE2b-256 d30a9fccf61cfbb1fbac35fcfef9397088536e995ae8641d1c563afdc21225e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page