A/B test framework: power analysis, frequentist & Bayesian tests, CUPED, sequential testing, novelty detection
Project description
abtest-framework
A production-grade A/B testing library for data scientists. Covers the full experiment lifecycle — from power analysis through Bayesian decision-making — with a Streamlit dashboard for interactive analysis.
Features
| Module | What it does |
|---|---|
power |
Sample size & MDE calculations for proportions and means |
frequentist |
z-test, Welch's t-test, CUPED variance reduction, Mann-Whitney, Bonferroni/BH correction |
bayesian |
Beta-Binomial & Normal-Normal Bayesian tests with expected loss |
sequential |
SPRT, always-valid p-values, novelty/primacy effect detection, SRM check |
Quick Start
from abtest.power import sample_size_proportion
from abtest.frequentist import proportion_test, cuped_test
from abtest.bayesian import bayesian_proportion_test
from abtest.sequential import SPRTMonitor, detect_novelty_effect, check_srm
# 1. Pre-experiment: how many users do I need?
result = sample_size_proportion(baseline_rate=0.10, mde=0.02, alpha=0.05, power=0.80)
print(result)
# Per-group n: 3,841 | Total n: 7,682
# 2. Sanity check before analysis
srm = check_srm(actual_ctrl=3841, actual_trt=3841)
print(srm["message"]) # ✅ No SRM detected
# 3. Frequentist test
result = proportion_test(388, 3841, 469, 3841, alpha=0.05)
print(result)
# Lift: +20.9% | p=0.0033 | ✅ Significant
# 4. CUPED variance reduction (secondary metric)
cuped_result = cuped_test(ctrl_revenue, trt_revenue, ctrl_pre_revenue, trt_pre_revenue)
print(f"Variance reduced by {cuped_result.variance_reduction_pct}%")
# 5. Bayesian decision
bayes = bayesian_proportion_test(388, 3841, 469, 3841, loss_threshold=0.001)
print(f"P(treatment better): {bayes.prob_treatment_better:.1%}")
print(f"Decision: {bayes.decision}")
# P(treatment better): 99.8%
# Decision: ✅ Ship Treatment
# 6. Sequential monitoring (stop early when evidence is strong)
monitor = SPRTMonitor(p0=0.10, p1=0.12, alpha=0.05, beta=0.20)
for ctrl_obs, trt_obs in data_stream:
result = monitor.update(ctrl_obs, trt_obs)
if result.decision != "continue":
print(result)
break
# 7. Check for novelty effects
novelty = detect_novelty_effect(ctrl_timeseries, trt_timeseries, n_windows=4)
print(novelty)
Installation
# From PyPI (once published)
pip install abtest-framework
# From source
git clone https://github.com/yourusername/abtest-framework
cd abtest-framework
pip install -e ".[app]"
Streamlit Dashboard
streamlit run app.py
The dashboard has four pages:
- Power Analysis — interactive sample size calculator with power curves
- Significance Tests — proportion z-test, CUPED, multiple testing correction
- Bayesian Analysis — posterior visualisation and expected loss decisions
- Sequential & Novelty — SPRT monitoring chart and novelty effect detector
Project Structure
abtest-framework/
├── abtest/
│ ├── __init__.py # Clean public API
│ ├── power.py # Sample size & MDE calculations
│ ├── frequentist.py # z-test, t-test, CUPED, BH correction
│ ├── bayesian.py # Beta-Binomial & Normal-Normal Bayesian tests
│ └── sequential.py # SPRT, always-valid p-values, novelty detection
├── tests/
│ └── test_abtest.py # 23 unit tests (pytest)
├── examples/
│ └── full_experiment_walkthrough.py
├── app.py # Streamlit dashboard
├── pyproject.toml # Package config (PyPI-ready)
└── README.md
Key Concepts Implemented
CUPED (Controlled-experiment Using Pre-Experiment Data)
Variance reduction technique from Microsoft Research (Deng et al., 2013). Regresses out a pre-experiment covariate to reduce metric noise, giving equivalent power with fewer users.
Expected Loss (Bayesian Decision Rule)
Instead of a binary significant/not-significant decision, compute:
E[loss | choose treatment]= average amount lost if treatment turns out worse- Ship when expected loss falls below a business-defined threshold
SPRT (Sequential Probability Ratio Test)
Wald's sequential test that lets you stop experiments early when evidence accumulates — without inflating the false positive rate the way repeated classical testing does.
Novelty Effect Detection
Fits a linear regression on per-window treatment lift over time. Flags experiments where lift significantly decays — indicating users were reacting to novelty, not real value.
Sample Ratio Mismatch (SRM)
Chi-square test that checks whether the actual user split matches the intended randomisation ratio. A significant SRM means the experiment has a bug and results are invalid.
Testing
pytest tests/ -v
# 23 passed in 1.55s
Publishing to PyPI
pip install build twine
python -m build
twine upload dist/*
References
- Deng, A., et al. (2013). Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data. WSDM.
- Johari, R., et al. (2017). Peeking at A/B Tests: Why it matters, and what to do about it. KDD.
- Wald, A. (1945). Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics.
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abtest_framework-0.1.0.tar.gz.
File metadata
- Download URL: abtest_framework-0.1.0.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ca9911afd877909b0a4c70837f5d06c7a142726b485874bff379fbed301516d4
|
|
| MD5 |
7d13d97f2dc1daae81a28f405163b098
|
|
| BLAKE2b-256 |
8e4d28b5f4b42f7eaa4d42f9df0cf1c5cb334e8f2eb1bc83b9d491b30e2b7771
|
File details
Details for the file abtest_framework-0.1.0-py3-none-any.whl.
File metadata
- Download URL: abtest_framework-0.1.0-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc4aafeeaf438e8ce45ddf0c5930e3cd9bf859d8e729a3f56527088648d3eb10
|
|
| MD5 |
f01d7247f5c427381240f8537e1f8a33
|
|
| BLAKE2b-256 |
b8d51d388d0b44e1db65ccfa36e3fcc7f9a199afe630c1104f4a1e4b272a146a
|