Statistical validity auditor for A/B tests — because significant != trustworthy.
Project description
abaudit
Statistical Validity Auditor for A/B Tests
A significant p-value answers the wrong question.
abaudit asks: given that the result is significant, how likely is it to actually be real?
Why abaudit?
Every A/B testing tool tells you whether your result is significant.
None of them tell you whether to trust it.
A p-value is P(data | no effect) — the probability of seeing your data if there's no effect.
What you actually want is P(true effect | significant result) — the Positive Predictive Value (PPV).
These are not the same thing. With a low prior, multiple metrics tested, and a few interim peeks, a p = 0.03 result might only have a 20% chance of being real. Standard tools report it as significant and move on. abaudit doesn't.
The math comes from Ioannidis (2005):
$$\text{PPV} = \frac{(1-\beta) \cdot f}{(1-\beta) \cdot f + \alpha \cdot (1-f)}$$
Where $f$ is your prior probability the effect exists, $1-\beta$ is power, and $\alpha$ is your significance threshold. This is Bayes' rule applied to hypothesis testing — and it's what p-values completely ignore.
Quickstart
pip install abaudit
import numpy as np
import abaudit as ab
rng = np.random.default_rng(42)
ctrl = rng.normal(0.0, 1.0, 500)
trt = rng.normal(0.3, 1.0, 500)
result = ab.audit(
ctrl, trt,
prior_f = 0.2,
metrics = ['conversion', 'revenue', 'session_time'],
n_peeks = 5,
)
result.summary()
abaudit — Experiment Validity Report
┌──────────────────────────────────┬─────────────────┬────────┐
│ Check │ Result │ Status │
├──────────────────────────────────┼─────────────────┼────────┤
│ p-value (primary) │ 0.0000 │ ✅ │
│ p-value (Bonferroni corrected) │ 0.0001 │ ✅ │
│ PPV — prob. effect is real │ 0.83 │ ✅ │
│ Statistical power │ 0.99 │ ✅ │
│ Sample Ratio Mismatch │ p = 1.000 │ ✅ │
│ Metrics tested │ 3 │ ⚠️ │
│ Optional stopping (peeks) │ eff. α = 0.226 │ ⚠️ │
│ Effect size (Cohen's d) │ 0.271 │ ✅ │
└──────────────────────────────────┴─────────────────┴────────┘
Bias score: [███░░░░░░░░░░░░░░░░░] 0.15 / 1.0 🟢 Low concern
⚠️ Warnings:
• 3 metrics tested — Bonferroni-corrected p = 0.0001 (raw p = 0.0000).
• Optional stopping risk: p-value checked 5 times. Effective α ≈ 0.226 (nominal: 0.05).
💡 Recommendations:
• Use sequential testing (SPRT) or an alpha-spending function
when interim looks are necessary.
# Save a shareable HTML report
ab.generate_report(result, path="audit_report.html")
# Pre-experiment: is this worth running?
plan = ab.design_summary(effect_size=0.3, prior_f=0.2)
plan.summary()
# During-experiment: health checks
ab.check_srm(n_control=4850, n_treatment=5150)
ab.check_optional_stopping([0.12, 0.08, 0.04, 0.06, 0.03])
What abaudit checks
| Module | Check | Answers |
|---|---|---|
validity |
PPV (Ioannidis 2005) | Given the significant result, what's the probability it's real? |
validity |
Multiple metric correction | You tested 3 things — what's the Bonferroni-corrected p? |
validity |
Effect size plausibility | Is the reported effect suspiciously large (winner's curse)? |
validity |
Statistical power | Was the study large enough to detect the effect reliably? |
runtime |
Sample Ratio Mismatch | Was traffic split as intended? |
runtime |
Optional stopping | Was the p-value checked multiple times during collection? |
runtime |
Novelty effect | Did the effect fade after the initial novelty wore off? |
design |
PPV-aware power analysis | How large does n need to be so results are actually trustworthy? |
report |
HTML report | Self-contained report for sharing with stakeholders |
What abaudit gives you that standard tools don't
| Standard tool | abaudit |
|---|---|
| Reports p-value | Reports p-value and PPV |
| Ignores your prior | Uses Ioannidis PPV framework |
| Ignores multiple metrics | Applies Bonferroni correction automatically |
| Ignores peeking | Diagnoses optional stopping and inflated α |
| Ignores traffic split | Runs Sample Ratio Mismatch test |
| No composite score | Bias score 0–1 with breakdown |
| No HTML output | Self-contained shareable report |
Full API
import abaudit as ab
# ── Post-experiment audit ─────────────────────────────────────
result = ab.audit(
control = ctrl, # array-like, control group
treatment = trt, # array-like, treatment group
prior_f = 0.2, # prior probability effect is real
alpha = 0.05, # significance threshold
metrics = ['conversion', 'revenue'], # all metrics tested
primary = 'conversion', # the one being reported
n_peeks = 3, # number of interim looks
expected_split = 0.5, # intended traffic split
)
result.summary() # traffic-light table
result.ppv # float: prob. effect is real
result.bias_score # float 0–1: composite red flags
result.flags # list[str]: warnings
ab.generate_report(result, "report.html")
# ── Pre-experiment planning ───────────────────────────────────
plan = ab.design_summary(
effect_size = 0.3, # expected Cohen's d
prior_f = 0.2, # prior probability
target_power = 0.80,
target_ppv = 0.80,
)
plan.summary()
plan.n_recommended # n per group to achieve both targets
ab.power_analysis(effect_size=0.3)
ab.ppv_given_design(effect_size=0.3, n_per_group=176, prior_f=0.2)
ab.minimum_trustworthy_n(effect_size=0.3, prior_f=0.2, target_ppv=0.80)
# ── During-experiment checks ──────────────────────────────────
ab.check_srm(n_control=4850, n_treatment=5150)
ab.check_optional_stopping(p_value_history=[0.12, 0.08, 0.04])
ab.check_novelty_effect(
early_control, early_treatment,
late_control, late_treatment,
)
Demo notebook
See examples/demo.ipynb for a complete end-to-end walkthrough:
a realistic e-commerce A/B test from experiment design to HTML audit report,
with visualizations of PPV vs. prior, peeking inflation, and the bias score breakdown.
Statistical foundation
- Ioannidis, J.P.A. (2005). Why Most Published Research Findings Are False. PLOS Medicine 2(8): e124.
- Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology. Psychological Science 22(11).
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge University Press.
- Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate. JRSS-B 57(1).
Development
git clone https://github.com/aldair-ai/abaudit.git
cd abaudit
pip install -e ".[dev]"
pytest tests/ -v
| Phase | Module | Tests | Status |
|---|---|---|---|
| 0 | Scaffold + _stats.py |
27 | ✅ Complete |
| 1 | validity.py — core audit |
42 | ✅ Complete |
| 2 | design.py — pre-experiment |
35 | ✅ Complete |
| 3 | runtime.py — health checks |
35 | ✅ Complete |
| 4 | report.py — HTML reports |
11 | ✅ Complete |
Total: 184 tests · 99% coverage · Python 3.9 – 3.12
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file abaudit-0.1.2.tar.gz.
File metadata
- Download URL: abaudit-0.1.2.tar.gz
- Upload date:
- Size: 722.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
686b3ffb2ae3c067c7732e4ce377528dc0b3ee535d25946810c382828e124aa1
|
|
| MD5 |
e4a85db5c7151c3f5f742a08a73f8c7f
|
|
| BLAKE2b-256 |
59c53f1614f95f9e33a11ef2c056d5ac8c16b0a626d4b68ff80ff41baa555c6d
|
Provenance
The following attestation bundles were made for abaudit-0.1.2.tar.gz:
Publisher:
publish.yml on aldair-ai/abaudit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
abaudit-0.1.2.tar.gz -
Subject digest:
686b3ffb2ae3c067c7732e4ce377528dc0b3ee535d25946810c382828e124aa1 - Sigstore transparency entry: 1589695708
- Sigstore integration time:
-
Permalink:
aldair-ai/abaudit@52bc1512f758427dc913e738079afe832d11df6d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/aldair-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@52bc1512f758427dc913e738079afe832d11df6d -
Trigger Event:
release
-
Statement type:
File details
Details for the file abaudit-0.1.2-py3-none-any.whl.
File metadata
- Download URL: abaudit-0.1.2-py3-none-any.whl
- Upload date:
- Size: 27.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c5355ddd95dea6b99cea81f2a41bfacd3ed55244e709b6f343a649ffc354483f
|
|
| MD5 |
c479859ff3a5a235fd421fec3f816631
|
|
| BLAKE2b-256 |
3301624cd9a26a436a25db428a96cc9287a97b504eec1058e12c321bfac6981e
|
Provenance
The following attestation bundles were made for abaudit-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on aldair-ai/abaudit
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
abaudit-0.1.2-py3-none-any.whl -
Subject digest:
c5355ddd95dea6b99cea81f2a41bfacd3ed55244e709b6f343a649ffc354483f - Sigstore transparency entry: 1589695780
- Sigstore integration time:
-
Permalink:
aldair-ai/abaudit@52bc1512f758427dc913e738079afe832d11df6d -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/aldair-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@52bc1512f758427dc913e738079afe832d11df6d -
Trigger Event:
release
-
Statement type: