Skip to main content

Statistical validity auditor for A/B tests — because significant != trustworthy.

Project description

abaudit

Statistical Validity Auditor for A/B Tests

Tests PyPI version Python Coverage License: MIT

A significant p-value answers the wrong question.
abaudit asks: given that the result is significant, how likely is it to actually be real?


Why abaudit?

Every A/B testing tool tells you whether your result is significant.
None of them tell you whether to trust it.

A p-value is P(data | no effect) — the probability of seeing your data if there's no effect.
What you actually want is P(true effect | significant result) — the Positive Predictive Value (PPV).

These are not the same thing. With a low prior, multiple metrics tested, and a few interim peeks, a p = 0.03 result might only have a 20% chance of being real. Standard tools report it as significant and move on. abaudit doesn't.

The math comes from Ioannidis (2005):

$$\text{PPV} = \frac{(1-\beta) \cdot f}{(1-\beta) \cdot f + \alpha \cdot (1-f)}$$

Where $f$ is your prior probability the effect exists, $1-\beta$ is power, and $\alpha$ is your significance threshold. This is Bayes' rule applied to hypothesis testing — and it's what p-values completely ignore.


Quickstart

pip install abaudit
import numpy as np
import abaudit as ab

rng  = np.random.default_rng(42)
ctrl = rng.normal(0.0, 1.0, 500)
trt  = rng.normal(0.3, 1.0, 500)

result = ab.audit(
    ctrl, trt,
    prior_f = 0.2,
    metrics = ['conversion', 'revenue', 'session_time'],
    n_peeks = 5,
)

result.summary()
         abaudit — Experiment Validity Report
┌──────────────────────────────────┬─────────────────┬────────┐
│ Check                            │ Result          │ Status │
├──────────────────────────────────┼─────────────────┼────────┤
│ p-value (primary)                │ 0.0000          │ ✅     │
│ p-value (Bonferroni corrected)   │ 0.0001          │ ✅     │
│ PPV — prob. effect is real       │ 0.83            │ ✅     │
│ Statistical power                │ 0.99            │ ✅     │
│ Sample Ratio Mismatch            │ p = 1.000       │ ✅     │
│ Metrics tested                   │ 3               │ ⚠️     │
│ Optional stopping (peeks)        │ eff. α = 0.226  │ ⚠️     │
│ Effect size (Cohen's d)          │ 0.271           │ ✅     │
└──────────────────────────────────┴─────────────────┴────────┘

Bias score: [███░░░░░░░░░░░░░░░░░] 0.15 / 1.0  🟢 Low concern

⚠️  Warnings:
   • 3 metrics tested — Bonferroni-corrected p = 0.0001 (raw p = 0.0000).
   • Optional stopping risk: p-value checked 5 times. Effective α ≈ 0.226 (nominal: 0.05).

💡 Recommendations:
   • Use sequential testing (SPRT) or an alpha-spending function
     when interim looks are necessary.
# Save a shareable HTML report
ab.generate_report(result, path="audit_report.html")

# Pre-experiment: is this worth running?
plan = ab.design_summary(effect_size=0.3, prior_f=0.2)
plan.summary()

# During-experiment: health checks
ab.check_srm(n_control=4850, n_treatment=5150)
ab.check_optional_stopping([0.12, 0.08, 0.04, 0.06, 0.03])

What abaudit checks

Module Check Answers
validity PPV (Ioannidis 2005) Given the significant result, what's the probability it's real?
validity Multiple metric correction You tested 3 things — what's the Bonferroni-corrected p?
validity Effect size plausibility Is the reported effect suspiciously large (winner's curse)?
validity Statistical power Was the study large enough to detect the effect reliably?
runtime Sample Ratio Mismatch Was traffic split as intended?
runtime Optional stopping Was the p-value checked multiple times during collection?
runtime Novelty effect Did the effect fade after the initial novelty wore off?
design PPV-aware power analysis How large does n need to be so results are actually trustworthy?
report HTML report Self-contained report for sharing with stakeholders

What abaudit gives you that standard tools don't

Standard tool abaudit
Reports p-value Reports p-value and PPV
Ignores your prior Uses Ioannidis PPV framework
Ignores multiple metrics Applies Bonferroni correction automatically
Ignores peeking Diagnoses optional stopping and inflated α
Ignores traffic split Runs Sample Ratio Mismatch test
No composite score Bias score 0–1 with breakdown
No HTML output Self-contained shareable report

Full API

import abaudit as ab

# ── Post-experiment audit ─────────────────────────────────────
result = ab.audit(
    control        = ctrl,          # array-like, control group
    treatment      = trt,           # array-like, treatment group
    prior_f        = 0.2,           # prior probability effect is real
    alpha          = 0.05,          # significance threshold
    metrics        = ['conversion', 'revenue'],  # all metrics tested
    primary        = 'conversion',  # the one being reported
    n_peeks        = 3,             # number of interim looks
    expected_split = 0.5,           # intended traffic split
)
result.summary()                    # traffic-light table
result.ppv                          # float: prob. effect is real
result.bias_score                   # float 0–1: composite red flags
result.flags                        # list[str]: warnings
ab.generate_report(result, "report.html")

# ── Pre-experiment planning ───────────────────────────────────
plan = ab.design_summary(
    effect_size  = 0.3,             # expected Cohen's d
    prior_f      = 0.2,             # prior probability
    target_power = 0.80,
    target_ppv   = 0.80,
)
plan.summary()
plan.n_recommended                  # n per group to achieve both targets

ab.power_analysis(effect_size=0.3)
ab.ppv_given_design(effect_size=0.3, n_per_group=176, prior_f=0.2)
ab.minimum_trustworthy_n(effect_size=0.3, prior_f=0.2, target_ppv=0.80)

# ── During-experiment checks ──────────────────────────────────
ab.check_srm(n_control=4850, n_treatment=5150)
ab.check_optional_stopping(p_value_history=[0.12, 0.08, 0.04])
ab.check_novelty_effect(
    early_control, early_treatment,
    late_control,  late_treatment,
)

Demo notebook

See examples/demo.ipynb for a complete end-to-end walkthrough: a realistic e-commerce A/B test from experiment design to HTML audit report, with visualizations of PPV vs. prior, peeking inflation, and the bias score breakdown.


Statistical foundation

  • Ioannidis, J.P.A. (2005). Why Most Published Research Findings Are False. PLOS Medicine 2(8): e124.
  • Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology. Psychological Science 22(11).
  • Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments. Cambridge University Press.
  • Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate. JRSS-B 57(1).

Development

git clone https://github.com/aldair-ai/abaudit.git
cd abaudit
pip install -e ".[dev]"
pytest tests/ -v
Phase Module Tests Status
0 Scaffold + _stats.py 27 ✅ Complete
1 validity.py — core audit 42 ✅ Complete
2 design.py — pre-experiment 35 ✅ Complete
3 runtime.py — health checks 35 ✅ Complete
4 report.py — HTML reports 11 ✅ Complete

Total: 184 tests · 99% coverage · Python 3.9 – 3.12


License

MIT © Edwin Aldair Espinoza Zegarra

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

abaudit-0.1.2.tar.gz (722.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

abaudit-0.1.2-py3-none-any.whl (27.7 kB view details)

Uploaded Python 3

File details

Details for the file abaudit-0.1.2.tar.gz.

File metadata

  • Download URL: abaudit-0.1.2.tar.gz
  • Upload date:
  • Size: 722.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for abaudit-0.1.2.tar.gz
Algorithm Hash digest
SHA256 686b3ffb2ae3c067c7732e4ce377528dc0b3ee535d25946810c382828e124aa1
MD5 e4a85db5c7151c3f5f742a08a73f8c7f
BLAKE2b-256 59c53f1614f95f9e33a11ef2c056d5ac8c16b0a626d4b68ff80ff41baa555c6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for abaudit-0.1.2.tar.gz:

Publisher: publish.yml on aldair-ai/abaudit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file abaudit-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: abaudit-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 27.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for abaudit-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c5355ddd95dea6b99cea81f2a41bfacd3ed55244e709b6f343a649ffc354483f
MD5 c479859ff3a5a235fd421fec3f816631
BLAKE2b-256 3301624cd9a26a436a25db428a96cc9287a97b504eec1058e12c321bfac6981e

See more details on using hashes here.

Provenance

The following attestation bundles were made for abaudit-0.1.2-py3-none-any.whl:

Publisher: publish.yml on aldair-ai/abaudit

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page