Skip to main content

Bayesian A/B testing with simple probabilities.

Project description

Tests Codecov PyPI

Bayesian A/B testing

bayesian_testing is a small package for a quick evaluation of A/B (or A/B/C/...) tests using Bayesian approach.

Implemented tests:

  • BinaryDataTest
    • Input data - binary data ([0, 1, 0, ...])
    • Designed for conversion-like data A/B testing.
  • NormalDataTest
    • Input data - normal data with unknown variance
    • Designed for normal data A/B testing.
  • DeltaLognormalDataTest
    • Input data - lognormal data with zeros
    • Designed for revenue-like data A/B testing.
  • DeltaNormalDataTest
    • Input data - normal data with zeros
    • Designed for profit-like data A/B testing.
  • DiscreteDataTest
    • Input data - categorical data with numerical categories
    • Designed for discrete data A/B testing (e.g. dice rolls, star ratings, 1-10 ratings, etc.).
  • PoissonDataTest
    • Input data - non-negative integers ([1, 0, 3, ...])
    • Designed for poisson data A/B testing.
  • ExponentialDataTest
    • Input data - exponential data (non-negative real numbers)
    • Designed for exponential data A/B testing (e.g. session/waiting time, time between events, etc.).

Implemented evaluation metrics:

  • Posterior Mean
    • Expected value from the posterior distribution for a given variant.
  • Credible Interval
    • Quantile-based credible intervals based on simulations from posterior distributions (i.e. empirical).
    • Interval probability (interval_alpha) can be set during the evaluation (default value is 95%).
  • Probability of Being Best
    • Probability that a given variant is best among all variants.
    • By default, the best is equivalent to the greatest (from a data/metric point of view), however it is possible to change this by using min_is_best=True in the evaluation method (this can be useful if we try to find the variant with the smallest tested measure).
  • Expected Loss
    • "Risk" of choosing particular variant over other variants in the test.
    • Measured in same units as a tested measure (e.g. positive rate or average value).

Credible Interval, Probability of Being Best and Expected Loss are calculated using simulations from posterior distributions (considering given data).

Installation

bayesian_testing can be installed using pip:

pip install bayesian_testing

Alternatively, you can clone the repository and use poetry manually:

cd bayesian_testing
pip install poetry
poetry install
poetry shell

Basic Usage

The primary features are classes:

  • BinaryDataTest
  • NormalDataTest
  • DeltaLognormalDataTest
  • DeltaNormalDataTest
  • DiscreteDataTest
  • PoissonDataTest
  • ExponentialDataTest

All test classes support two methods to insert the data:

  • add_variant_data - Adding raw data for a variant as a list of observations (or numpy 1-D array).
  • add_variant_data_agg - Adding aggregated variant data (this can be practical for a large data, as the aggregation can be done already on a database level).

Both methods for adding data allow specification of prior distributions (see details in respective docstrings). Default prior setup should be sufficient for most of the cases (e.g. cases with unknown priors or large amounts of data).

To get the results of the test, simply call the method evaluate.

Probability of being best, expected loss and credible intervals are approximated using simulations, hence the evaluate method can return slightly different values for different runs. To stabilize it, you can set the sim_count parameter of the evaluate to a higher value (default value is 20K), or even use the seed parameter to fix it completely.

BinaryDataTest

Class for a Bayesian A/B test for the binary-like data (e.g. conversions, successes, etc.).

Example:

import numpy as np
from bayesian_testing.experiments import BinaryDataTest

# generating some random data
rng = np.random.default_rng(52)
# random 1x1500 array of 0/1 data with 5.2% probability for 1:
data_a = rng.binomial(n=1, p=0.052, size=1500)
# random 1x1200 array of 0/1 data with 6.7% probability for 1:
data_b = rng.binomial(n=1, p=0.067, size=1200)

# initialize a test:
test = BinaryDataTest()

# add variant using raw data (arrays of zeros and ones):
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
# priors can be specified like this (default for this test is a=b=1/2):
# test.add_variant_data("B", data_b, a_prior=1, b_prior=20)

# add variant using aggregated data (same as raw data with 950 zeros and 50 ones):
test.add_variant_data_agg("C", totals=1000, positives=50)

# evaluate test:
results = test.evaluate()
results
# print(pd.DataFrame(results).set_index('variant').T.to_markdown(tablefmt="grid"))
+-------------------+-----------+-------------+-------------+
|                   | A         | B           | C           |
+===================+===========+=============+=============+
| totals            | 1500      | 1200        | 1000        |
+-------------------+-----------+-------------+-------------+
| positives         | 80        | 80          | 50          |
+-------------------+-----------+-------------+-------------+
| positive_rate     | 0.05333   | 0.06667     | 0.05        |
+-------------------+-----------+-------------+-------------+
| posterior_mean    | 0.05363   | 0.06703     | 0.05045     |
+-------------------+-----------+-------------+-------------+
| credible_interval | [0.04284, | [0.0535309, | [0.0379814, |
|                   | 0.065501] | 0.0816476]  | 0.0648625]  |
+-------------------+-----------+-------------+-------------+
| prob_being_best   | 0.06485   | 0.89295     | 0.0422      |
+-------------------+-----------+-------------+-------------+
| expected_loss     | 0.0139248 | 0.0004693   | 0.0170767   |
+-------------------+-----------+-------------+-------------+

NormalDataTest

Class for a Bayesian A/B test for the normal data.

Example:

import numpy as np
from bayesian_testing.experiments import NormalDataTest

# generating some random data
rng = np.random.default_rng(21)
data_a = rng.normal(7.2, 2, 1000)
data_b = rng.normal(7.1, 2, 800)
data_c = rng.normal(7.0, 4, 500)

# initialize a test:
test = NormalDataTest()

# add variant using raw data:
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
# test.add_variant_data("C", data_c)

# add variant using aggregated data:
test.add_variant_data_agg("C", len(data_c), sum(data_c), sum(np.square(data_c)))

# evaluate test:
results = test.evaluate(sim_count=20000, seed=52, min_is_best=False, interval_alpha=0.99)
results
# print(pd.DataFrame(results).set_index('variant').T.to_markdown(tablefmt="grid"))
+-------------------+-------------+-------------+-------------+
|                   | A           | B           | C           |
+===================+=============+=============+=============+
| totals            | 1000        | 800         | 500         |
+-------------------+-------------+-------------+-------------+
| sum_values        | 7294.67901  | 5685.86168  | 3736.91581  |
+-------------------+-------------+-------------+-------------+
| avg_values        | 7.29468     | 7.10733     | 7.47383     |
+-------------------+-------------+-------------+-------------+
| posterior_mean    | 7.29462     | 7.10725     | 7.4737      |
+-------------------+-------------+-------------+-------------+
| credible_interval | [7.1359436, | [6.9324733, | [7.0240102, |
|                   | 7.4528369]  | 7.2779293]  | 7.9379341]  |
+-------------------+-------------+-------------+-------------+
| prob_being_best   | 0.1707      | 0.00125     | 0.82805     |
+-------------------+-------------+-------------+-------------+
| expected_loss     | 0.1968735   | 0.385112    | 0.0169998   |
+-------------------+-------------+-------------+-------------+

DeltaLognormalDataTest

Class for a Bayesian A/B test for the delta-lognormal data (log-normal with zeros). Delta-lognormal data is typical case of revenue per session data where many sessions have 0 revenue but non-zero values are positive values with possible log-normal distribution. To handle this data, the calculation is combining binary Bayes model for zero vs non-zero "conversions" and log-normal model for non-zero values.

Example:

import numpy as np
from bayesian_testing.experiments import DeltaLognormalDataTest

test = DeltaLognormalDataTest()

data_a = [7.1, 0.3, 5.9, 0, 1.3, 0.3, 0, 1.2, 0, 3.6, 0, 1.5,
          2.2, 0, 4.9, 0, 0, 1.1, 0, 0, 7.1, 0, 6.9, 0]
data_b = [4.0, 0, 3.3, 19.3, 18.5, 0, 0, 0, 12.9, 0, 0, 0, 10.2,
          0, 0, 23.1, 0, 3.7, 0, 0, 11.3, 10.0, 0, 18.3, 12.1]

# adding variant using raw data:
test.add_variant_data("A", data_a)
# test.add_variant_data("B", data_b)

# alternatively, variant can be also added using aggregated data
# (looks more complicated, but it can be quite handy for a large data):
test.add_variant_data_agg(
    name="B",
    totals=len(data_b),
    positives=sum(x > 0 for x in data_b),
    sum_values=sum(data_b),
    sum_logs=sum([np.log(x) for x in data_b if x > 0]),
    sum_logs_2=sum([np.square(np.log(x)) for x in data_b if x > 0])
)

# evaluate test:
results = test.evaluate(seed=21)
results
# print(pd.DataFrame(results).set_index('variant').T.to_markdown(tablefmt="grid"))
+---------------------+-------------+-------------+
|                     | A           | B           |
+=====================+=============+=============+
| totals              | 24          | 25          |
+---------------------+-------------+-------------+
| positives           | 13          | 12          |
+---------------------+-------------+-------------+
| sum_values          | 43.4        | 146.7       |
+---------------------+-------------+-------------+
| avg_values          | 1.80833     | 5.868       |
+---------------------+-------------+-------------+
| avg_positive_values | 3.33846     | 12.225      |
+---------------------+-------------+-------------+
| posterior_mean      | 2.09766     | 6.19017     |
+---------------------+-------------+-------------+
| credible_interval   | [0.9884509, | [3.3746212, |
|                     | 6.9054963]  | 11.7349253] |
+---------------------+-------------+-------------+
| prob_being_best     | 0.04815     | 0.95185     |
+---------------------+-------------+-------------+
| expected_loss       | 4.0941101   | 0.1588627   |
+---------------------+-------------+-------------+

Note: Alternatively, DeltaNormalDataTest can be used for a case when conversions are not necessarily positive values.

DiscreteDataTest

Class for a Bayesian A/B test for the discrete data with finite number of numerical categories (states), representing some value. This test can be used for instance for dice rolls data (when looking for the "best" of multiple dice) or rating data (e.g. 1-5 stars or 1-10 scale).

Example:

from bayesian_testing.experiments import DiscreteDataTest

# dice rolls data for 3 dice - A, B, C
data_a = [2, 5, 1, 4, 6, 2, 2, 6, 3, 2, 6, 3, 4, 6, 3, 1, 6, 3, 5, 6]
data_b = [1, 2, 2, 2, 2, 3, 2, 3, 4, 2]
data_c = [1, 3, 6, 5, 4]

# initialize a test with all possible states (i.e. numerical categories):
test = DiscreteDataTest(states=[1, 2, 3, 4, 5, 6])

# add variant using raw data:
test.add_variant_data("A", data_a)
test.add_variant_data("B", data_b)
test.add_variant_data("C", data_c)

# add variant using aggregated data:
# test.add_variant_data_agg("C", [1, 0, 1, 1, 1, 1]) # equivalent to rolls in data_c

# evaluate test:
results = test.evaluate(sim_count=20000, seed=52, min_is_best=False, interval_alpha=0.95)
results
# print(pd.DataFrame(results).set_index('variant').T.to_markdown(tablefmt="grid"))
+-------------------+------------------+------------------+------------------+
|                   | A                | B                | C                |
+===================+==================+==================+==================+
| concentration     | {1: 2.0, 2: 4.0, | {1: 1.0, 2: 6.0, | {1: 1.0, 2: 0.0, |
|                   | 3: 4.0, 4: 2.0,  | 3: 2.0, 4: 1.0,  | 3: 1.0, 4: 1.0,  |
|                   | 5: 2.0, 6: 6.0}  | 5: 0.0, 6: 0.0}  | 5: 1.0, 6: 1.0}  |
+-------------------+------------------+------------------+------------------+
| average_value     | 3.8              | 2.3              | 3.8              |
+-------------------+------------------+------------------+------------------+
| posterior_mean    | 3.73077          | 2.75             | 3.63636          |
+-------------------+------------------+------------------+------------------+
| credible_interval | [3.0710797,      | [2.1791584,      | [2.6556465,      |
|                   | 4.3888021]       | 3.4589178]       | 4.5784839]       |
+-------------------+------------------+------------------+------------------+
| prob_being_best   | 0.54685          | 0.008            | 0.44515          |
+-------------------+------------------+------------------+------------------+
| expected_loss     | 0.199953         | 1.1826766        | 0.2870247        |
+-------------------+------------------+------------------+------------------+

PoissonDataTest

Class for a Bayesian A/B test for the poisson data.

Example:

from bayesian_testing.experiments import PoissonDataTest

# goals received - so less is better (duh...)
psg_goals_against = [0, 2, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 3, 1, 0]
city_goals_against = [0, 0, 3, 2, 0, 1, 0, 3, 0, 1, 1, 0, 1, 2]
bayern_goals_against = [1, 0, 0, 1, 1, 2, 1, 0, 2, 0, 0, 2, 2, 1, 0]

# initialize a test:
test = PoissonDataTest()

# add variant using raw data:
test.add_variant_data('psg', psg_goals_against)

# example with specific priors
# ("b_prior" as an effective sample size, and "a_prior/b_prior" as a prior mean):
test.add_variant_data('city', city_goals_against, a_prior=3, b_prior=1)
# test.add_variant_data('bayern', bayern_goals_against)

# add variant using aggregated data:
test.add_variant_data_agg("bayern", len(bayern_goals_against), sum(bayern_goals_against))

# evaluate test (since fewer goals is better, we explicitly set the min_is_best to True)
results = test.evaluate(sim_count=20000, seed=52, min_is_best=True)
results
# print(pd.DataFrame(results).set_index('variant').T.to_markdown(tablefmt="grid"))
+-------------------+-------------+-------------+------------+
|                   | psg         | city        | bayern     |
+===================+=============+=============+============+
| totals            | 15          | 14          | 15         |
+-------------------+-------------+-------------+------------+
| sum_values        | 9           | 14          | 13         |
+-------------------+-------------+-------------+------------+
| observed_average  | 0.6         | 1.0         | 0.86667    |
+-------------------+-------------+-------------+------------+
| posterior_mean    | 0.60265     | 1.13333     | 0.86755    |
+-------------------+-------------+-------------+------------+
| credible_interval | [0.2800848, | [0.6562029, | [0.465913, |
|                   | 1.0570327]  | 1.7265045]  | 1.3964389] |
+-------------------+-------------+-------------+------------+
| prob_being_best   | 0.78175     | 0.0344      | 0.18385    |
+-------------------+-------------+-------------+------------+
| expected_loss     | 0.0369998   | 0.5620553   | 0.3003345  |
+-------------------+-------------+-------------+------------+

note: Since we set min_is_best=True (because received goals are "bad"), probability and loss are in a favor of variants with lower posterior means.

ExponentialDataTest

Class for a Bayesian A/B test for the exponential data.

Example:

import numpy as np
from bayesian_testing.experiments import ExponentialDataTest

# waiting times for 3 different variants, each with many observations,
# generated using exponential distributions with defined scales (expected values)
waiting_times_a = np.random.exponential(scale=10, size=200)
waiting_times_b = np.random.exponential(scale=11, size=210)
waiting_times_c = np.random.exponential(scale=11, size=220)

# initialize a test:
test = ExponentialDataTest()
# adding variants using the observation data:
test.add_variant_data('A', waiting_times_a)
test.add_variant_data('B', waiting_times_b)
test.add_variant_data('C', waiting_times_c)

# alternatively, add variants using aggregated data:
# test.add_variant_data_agg('A', len(waiting_times_a), sum(waiting_times_a))

# evaluate test (since a lower waiting time is better, we set the min_is_best to True)
results = test.evaluate(sim_count=20000, min_is_best=True)
results
# print(pd.DataFrame(results).set_index('variant').T.to_markdown(tablefmt="grid"))
+-------------------+-------------+-------------+-------------+
|                   | A           | B           | C           |
+===================+=============+=============+=============+
| totals            | 200         | 210         | 220         |
+-------------------+-------------+-------------+-------------+
| sum_values        | 1827.81709  | 2217.46016  | 2160.73134  |
+-------------------+-------------+-------------+-------------+
| observed_average  | 9.13909     | 10.55933    | 9.82151     |
+-------------------+-------------+-------------+-------------+
| posterior_mean    | 9.13502     | 10.55478    | 9.8175      |
+-------------------+-------------+-------------+-------------+
| credible_interval | [7.994178,  | [9.2543372, | [8.6184821, |
|                   | 10.5410967] | 12.1527256] | 11.2566538] |
+-------------------+-------------+-------------+-------------+
| prob_being_best   | 0.7456      | 0.0405      | 0.2139      |
+-------------------+-------------+-------------+-------------+
| expected_loss     | 0.1428729   | 1.5674747   | 0.8230728   |
+-------------------+-------------+-------------+-------------+

Development

To set up a development environment, use Poetry and pre-commit:

pip install poetry
poetry install
poetry run pre-commit install

To be implemented

Additional metrics:

  • Potential Value Remaining

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bayesian_testing-0.8.0.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

bayesian_testing-0.8.0-py3-none-any.whl (34.1 kB view details)

Uploaded Python 3

File details

Details for the file bayesian_testing-0.8.0.tar.gz.

File metadata

  • Download URL: bayesian_testing-0.8.0.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.20 Linux/6.8.0-1014-azure

File hashes

Hashes for bayesian_testing-0.8.0.tar.gz
Algorithm Hash digest
SHA256 d38044067292517a0364e7f0caa69f6a4ceee46ea2f37103bd52bbc4dfc3cf46
MD5 c377ff6d1411c7e0f0f51eacb69ebdf0
BLAKE2b-256 f73a193049e41a1be8c96c844fff491264b2f76fd1a69799766bb1635feed0b8

See more details on using hashes here.

File details

Details for the file bayesian_testing-0.8.0-py3-none-any.whl.

File metadata

  • Download URL: bayesian_testing-0.8.0-py3-none-any.whl
  • Upload date:
  • Size: 34.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.20 Linux/6.8.0-1014-azure

File hashes

Hashes for bayesian_testing-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f119b54c243905753e8d26c3852ee926485f9f4d6eb910b5a1e6e0d3628bd05
MD5 14961968d4ec550f7532a2c113678b6f
BLAKE2b-256 05c6d38b118dec21e87624137eeb81a6b34501f4519fc4a9a0c9c850bada1a45

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page