Skip to main content

A/B test framework: power analysis, frequentist & Bayesian tests, CUPED, sequential testing, novelty detection

Project description

abtest-framework

A production-grade A/B testing library for data scientists. Covers the full experiment lifecycle — from power analysis through Bayesian decision-making — with a Streamlit dashboard for interactive analysis.

Features

Module What it does
power Sample size & MDE calculations for proportions and means
frequentist z-test, Welch's t-test, CUPED variance reduction, Mann-Whitney, Bonferroni/BH correction
bayesian Beta-Binomial & Normal-Normal Bayesian tests with expected loss
sequential SPRT, always-valid p-values, novelty/primacy effect detection, SRM check

Quick Start

from abtest.power import sample_size_proportion
from abtest.frequentist import proportion_test, cuped_test
from abtest.bayesian import bayesian_proportion_test
from abtest.sequential import SPRTMonitor, detect_novelty_effect, check_srm

# 1. Pre-experiment: how many users do I need?
result = sample_size_proportion(baseline_rate=0.10, mde=0.02, alpha=0.05, power=0.80)
print(result)
# Per-group n: 3,841  |  Total n: 7,682

# 2. Sanity check before analysis
srm = check_srm(actual_ctrl=3841, actual_trt=3841)
print(srm["message"])  # ✅ No SRM detected

# 3. Frequentist test
result = proportion_test(388, 3841, 469, 3841, alpha=0.05)
print(result)
# Lift: +20.9%  |  p=0.0033  |  ✅ Significant

# 4. CUPED variance reduction (secondary metric)
cuped_result = cuped_test(ctrl_revenue, trt_revenue, ctrl_pre_revenue, trt_pre_revenue)
print(f"Variance reduced by {cuped_result.variance_reduction_pct}%")

# 5. Bayesian decision
bayes = bayesian_proportion_test(388, 3841, 469, 3841, loss_threshold=0.001)
print(f"P(treatment better): {bayes.prob_treatment_better:.1%}")
print(f"Decision: {bayes.decision}")
# P(treatment better): 99.8%
# Decision: ✅ Ship Treatment

# 6. Sequential monitoring (stop early when evidence is strong)
monitor = SPRTMonitor(p0=0.10, p1=0.12, alpha=0.05, beta=0.20)
for ctrl_obs, trt_obs in data_stream:
    result = monitor.update(ctrl_obs, trt_obs)
    if result.decision != "continue":
        print(result)
        break

# 7. Check for novelty effects
novelty = detect_novelty_effect(ctrl_timeseries, trt_timeseries, n_windows=4)
print(novelty)

Installation

# From PyPI (once published)
pip install abtest-framework

# From source
git clone https://github.com/yourusername/abtest-framework
cd abtest-framework
pip install -e ".[app]"

Streamlit Dashboard

streamlit run app.py

The dashboard has four pages:

  • Power Analysis — interactive sample size calculator with power curves
  • Significance Tests — proportion z-test, CUPED, multiple testing correction
  • Bayesian Analysis — posterior visualisation and expected loss decisions
  • Sequential & Novelty — SPRT monitoring chart and novelty effect detector

Project Structure

abtest-framework/
├── abtest/
│   ├── __init__.py       # Clean public API
│   ├── power.py          # Sample size & MDE calculations
│   ├── frequentist.py    # z-test, t-test, CUPED, BH correction
│   ├── bayesian.py       # Beta-Binomial & Normal-Normal Bayesian tests
│   └── sequential.py     # SPRT, always-valid p-values, novelty detection
├── tests/
│   └── test_abtest.py    # 23 unit tests (pytest)
├── examples/
│   └── full_experiment_walkthrough.py
├── app.py                # Streamlit dashboard
├── pyproject.toml        # Package config (PyPI-ready)
└── README.md

Key Concepts Implemented

CUPED (Controlled-experiment Using Pre-Experiment Data)

Variance reduction technique from Microsoft Research (Deng et al., 2013). Regresses out a pre-experiment covariate to reduce metric noise, giving equivalent power with fewer users.

Expected Loss (Bayesian Decision Rule)

Instead of a binary significant/not-significant decision, compute:

  • E[loss | choose treatment] = average amount lost if treatment turns out worse
  • Ship when expected loss falls below a business-defined threshold

SPRT (Sequential Probability Ratio Test)

Wald's sequential test that lets you stop experiments early when evidence accumulates — without inflating the false positive rate the way repeated classical testing does.

Novelty Effect Detection

Fits a linear regression on per-window treatment lift over time. Flags experiments where lift significantly decays — indicating users were reacting to novelty, not real value.

Sample Ratio Mismatch (SRM)

Chi-square test that checks whether the actual user split matches the intended randomisation ratio. A significant SRM means the experiment has a bug and results are invalid.

Testing

pytest tests/ -v
# 23 passed in 1.55s

Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

References

  • Deng, A., et al. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. WSDM.
  • Johari, R., et al. (2017). Peeking at A/B Tests: Why it matters, and what to do about it. KDD.
  • Wald, A. (1945). Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics.
  • Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abtest_framework-0.1.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abtest_framework-0.1.0-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file abtest_framework-0.1.0.tar.gz.

File metadata

  • Download URL: abtest_framework-0.1.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for abtest_framework-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ca9911afd877909b0a4c70837f5d06c7a142726b485874bff379fbed301516d4
MD5 7d13d97f2dc1daae81a28f405163b098
BLAKE2b-256 8e4d28b5f4b42f7eaa4d42f9df0cf1c5cb334e8f2eb1bc83b9d491b30e2b7771

See more details on using hashes here.

File details

Details for the file abtest_framework-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for abtest_framework-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cc4aafeeaf438e8ce45ddf0c5930e3cd9bf859d8e729a3f56527088648d3eb10
MD5 f01d7247f5c427381240f8537e1f8a33
BLAKE2b-256 b8d51d388d0b44e1db65ccfa36e3fcc7f9a199afe630c1104f4a1e4b272a146a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page