A package to perform hypothesis testing and compute confidence intervals using bootstrapping.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Bstrap: A Python Package for confidence intervals and hypothesis testing using bootstrapping.

You are an amazing machine learning researcher.

You invented a new super cool method.

You are not sure that it is significantly better than your baseline.

You don't have 3000 GPUs to rerun your experiment and check it out.

Then, what you want to do is bootstrap your results!

The bstrap package allows you to compare two methods and claim that one is better than the other.

Installation

pip install bstrap

That's all you need, really.

Maybe tough, you can still read the instructions and check out the examples to make sure you get it right...

Features

Bootstrapping is a simple method to compute statistics over your custom metrics, using only one run of the method for each sample in your evaluation set. It has the advantage of being very versatile, and can be used with any metric really.

Bootstrapping for computation of confidence interval
Bootstrapping for hypothesis testing (claim that one method is better than another for a given metric)
Supports metrics that can be computed sample-wise and metrics that cannot.

Keep in mind: non-overlapping confidence intervals means that there is a significant statistical difference. Overlapping confidence intervals does not mean that there is no significant statistical difference. To verify this further, you will need to compute the bootstrap hypothesis testing and check the p-value.

Instructions

You will need to implement your metric and provide the data sample-wise as a single Pandas dataframe for each method. That's about it. Your metric is more complex than simply averaging results for each sample? For example, you cannot compute sample-wise, maybe like AUC or mAP? Then just give your predictions and ground truths sample-wise, which also works with Boostrap.

To use this code, you need to:

Implement your own metric: should take the one pandas dataframe of data as input and return a scalar value.
Load your data.
Reformat data to a single pandas dataframe per method with standardized column names, and one sample per row.
Check that your estimates (confidence interval and p-value) are stable over several runs of the bootstrapping method. If the estimates are not stable, increase nbr_runs

Enjoy!

Usage

You can find example dataframes under src/bstrap/example_dataframes.

Example 1: Mean metric

import pandas as pd
import numpy as np
from bstrap import bootstrap, boostrapping_CI

# 1. implement metric
metric = np.mean

# 2. load data
df = pd.read_csv("example_dataframes/example_dataframe_mean.csv")

# 3. reformat data to a single pandas dataframe per method
data_method1 = df["loss_method_1"]
data_method2 = df["loss_method_2"]

# 4. compare method 1 and 2
stats_method1, stats_method2, p_value = bootstrap(metric, data_method1, data_method2, nbr_runs=1000)
print(stats_method1)
print(stats_method2)
print(p_value)

# compute CI and mean for each method separately
stats_method1 = boostrapping_CI(metric, data_method1, nbr_runs=1000)
stats_method2 = boostrapping_CI(metric, data_method2, nbr_runs=1000)
print(stats_method1)
print(stats_method2)

Example 2: F1 score

import pandas as pd
import sys
from bstrap import bootstrap, boostrapping_CI

# 1. implement metric
def compute_f1(data):
    val_target = data["gt"].astype('bool')
    val_predict = data["predictions"].astype('bool')
    tp = np.count_nonzero(val_target * val_predict)
    fp = np.count_nonzero(~val_target * val_predict)
    fn = np.count_nonzero(val_target * ~val_predict)
    return tp * 1. / (tp + 0.5 * (fp + fn) + sys.float_info.epsilon)
metric = compute_f1

# 2. load data
df = pd.read_csv("example_dataframes/example_dataframe_f1.csv")

# 3. reformat data to a single pandas dataframe per method with standardized column names
data_method1 = df[["gt", "method1"]]
data_method1 = data_method1.rename(columns={"method1": "predictions"})
data_method2 = df[["gt", "method2"]]
data_method2 = data_method2.rename(columns={"method2": "predictions"})

# 4. compare method 1 and 2 (same code as example 1)
stats_method1, stats_method2, p_value = bootstrap(metric, data_method1, data_method2, nbr_runs=1000)
print(stats_method1)
print(stats_method2)
print(p_value)

# compute CI and mean for each method separately (same code as example 1)
stats_method1 = boostrapping_CI(metric, data_method1, nbr_runs=1000)
stats_method2 = boostrapping_CI(metric, data_method2, nbr_runs=1000)
print(stats_method1)
print(stats_method2)

Example 3: AUC

import pandas as pd
from sklearn.metrics import auc, roc_curve
from bstrap import bootstrap, boostrapping_CI

# 1. implement metric
def compute_auc(data):
    fpr, tpr, thresholds = roc_curve(data["gt"], data["predictions"], pos_label=1)
    return auc(fpr, tpr)
metric = compute_auc

# 2. load data
df = pd.read_csv("example_dataframes/example_dataframe_auc.csv")

# 3. reformat data to a single pandas dataframe per method with standardized column names
data_method1 = df[["gt", "method1"]]
data_method1 = data_method1.rename(columns={"method1": "predictions"})
data_method2 = df[["gt", "method2"]]
data_method2 = data_method2.rename(columns={"method2": "predictions"})

# 4. compare method 1 and 2 (same code as example 1)
stats_method1, stats_method2, p_value = bootstrap(metric, data_method1, data_method2, nbr_runs=1000)
print(stats_method1)
print(stats_method2)
print(p_value)

# compute CI and mean for each method separately (same code as example 1)
stats_method1 = boostrapping_CI(metric, data_method1, nbr_runs=1000)
stats_method2 = boostrapping_CI(metric, data_method2, nbr_runs=1000)
print(stats_method1)
print(stats_method2)

Example 4: Multiclass: mean Average Precision (mAP)

import pandas as pd
from sklearn.metrics import roc_curve
from bstrap import bootstrap, boostrapping_CI

# 1. implement metric
def compute_mAP(data):
    gt = data[[column for column in data.columns if 'gt' in column]]
    predictions = data[[column for column in data.columns if 'pred' in column]]
    return average_precision_score(gt, predictions, average='weighted')
metric = compute_mAP

# 2. load data
gt = pd.read_csv("example_dataframes/example_dataframe_mAP_gt.csv")
predictions_method1 = pd.read_csv("example_dataframes/example_dataframe_mAP_predictions_method1.csv")
predictions_method2 = pd.read_csv("example_dataframes/example_dataframe_mAP_predictions_method2.csv")

# 3. reformat data to a single pandas dataframe per method with standardized column names
gt = gt.rename(columns=dict([(column, 'gt_' + column) for column in gt.columns]))
predictions_method1 = predictions_method1.rename(
    columns=dict([(column, 'pred_' + column) for column in predictions_method1.columns]))
predictions_method2 = predictions_method2.rename(
    columns=dict([(column, 'pred_' + column) for column in predictions_method2.columns]))
data_method1 = pd.concat([gt, predictions_method1], axis=1)
data_method2 = pd.concat([gt, predictions_method2], axis=1)

# 4. compare method 1 and 2 (same code as example 1)
stats_method1, stats_method2, p_value = bootstrap(metric, data_method1, data_method2, nbr_runs=100)
print(stats_method1)
print(stats_method2)
print(p_value)

# compute CI and mean for each method separately (same code as example 1)
stats_method1 = boostrapping_CI(metric, data_method1, nbr_runs=100)
stats_method2 = boostrapping_CI(metric, data_method2, nbr_runs=100)
print(stats_method1)
print(stats_method2)

Reference:
Efron, B. and Tibshirani, R.J., 1994. An introduction to the bootstrap. CRC press.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.9

Aug 24, 2021

0.0.8

Aug 21, 2021

0.0.7

Aug 21, 2021

0.0.6

Aug 21, 2021

0.0.5

Aug 21, 2021

0.0.4

Aug 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bstrap-0.0.9.tar.gz (5.7 kB view details)

Uploaded Aug 24, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bstrap-0.0.9-py3-none-any.whl (6.6 kB view details)

Uploaded Aug 24, 2021 Python 3

File details

Details for the file bstrap-0.0.9.tar.gz.

File metadata

Download URL: bstrap-0.0.9.tar.gz
Upload date: Aug 24, 2021
Size: 5.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.1

File hashes

Hashes for bstrap-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`82e56aef03e9f4af1ea446a85a5b28e1dc1a061e7e1f43292c2a1ed1fc8aad0e`
MD5	`17011932b447a7f729934e89bd4c0f35`
BLAKE2b-256	`161ba975216a2aff690ccc8e72d83f4a5271508f1c7a8c6024d59018a3ce8e24`

See more details on using hashes here.

File details

Details for the file bstrap-0.0.9-py3-none-any.whl.

File metadata

Download URL: bstrap-0.0.9-py3-none-any.whl
Upload date: Aug 24, 2021
Size: 6.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.9.1

File hashes

Hashes for bstrap-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d4ae68ab4f0dd902d328cbeea6e0a9c07bd4ee142cd044e731d165fc93b92a75`
MD5	`7303b4026fbbad560b99f8cf4d4b2999`
BLAKE2b-256	`0a4056e6735a2c9509bafcdceead24fabc77a1f37befdc1b4d65fdf66a87b86f`

See more details on using hashes here.

bstrap 0.0.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bstrap: A Python Package for confidence intervals and hypothesis testing using bootstrapping.

Installation

Features

Instructions

Usage

Example 1: Mean metric

Example 2: F1 score

Example 3: AUC

Example 4: Multiclass: mean Average Precision (mAP)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes