Skip to main content

Bayesian decision engine for A/B testing

Project description

argonx

CI License Python

argonx is a decision-support system for A/B experiments. It handles Bayesian inference, multi-metric risk management, hierarchical segment-aware analysis, and sequential stopping — and surfaces a complete evidential picture so the right decision is obvious.

Most testing frameworks answer "is there an effect?" argonx answers "what should you do about it, and how much do you lose if you're wrong?"


Install

pip install git+https://github.com/souro26/bayesian-a-b-testing.git
# or, for local development
git clone https://github.com/souro26/bayesian-a-b-testing.git
cd bayesian-a-b-testing
pip install -e .

Quick Example

from argonx import Experiment

experiment = Experiment(
    data=df,
    variant_col='variant',
    primary_metric='revenue',
    guardrails=['page_load_ms'],
    lower_is_better={'page_load_ms': True},
    model='lognormal',
    guardrail_models={'page_load_ms': 'gaussian'},
    control='control',
)

result = experiment.run()
result.summary()
result.plot()

For ratio metrics, pass a callable directly — no class system needed:

experiment = Experiment(
    data=df,
    variant_col='variant',
    primary_metric=lambda df: df['clicks'] / df['impressions'],
    model='lognormal',
    control='control',
)

For segment-aware hierarchical inference, add one argument:

experiment = Experiment(
    data=df,
    variant_col='variant',
    segment_col='device_type',          # triggers hierarchical model automatically
    primary_metric='revenue',
    model='lognormal',
    control='control',
)

result = experiment.run()
result.summary()           # aggregate, population-level
result.segment_summary()   # per-segment decisions + cross-segment conflict detection

What It Computes

A p-value tells you the probability of seeing data this extreme if the null is true. It does not tell you what to do. argonx computes the quantities that actually drive decisions:

Metric What it answers
P(variant is best) Which variant has the highest posterior probability of being the true winner — computed via simultaneous argmax across all N variants, not pairwise comparison
Expected loss How much you lose on average if you ship the wrong variant — integrated over the full posterior, not a point estimate
CVaR Expected loss in the worst-case tail — catches cases where the average loss looks fine but catastrophic outcomes are possible
ROPE Is the effect large enough to matter in practice? An effect can be statistically real and business-irrelevant. ROPE separates these
HDI The actual probability interval — not a frequentist confidence interval. The lift is inside this range with 95% posterior probability
Joint probability P(all business conditions satisfied simultaneously) — not per-metric checks that miss correlations
Composite score Weighted multi-metric business impact, computed draw-by-draw from posteriors
Guardrail conflict When the primary metric improves and a guardrail degrades, the framework surfaces the conflict clearly rather than resolving it arbitrarily
Sequential stopping Evidence-based stopping signal. Stop when expected loss drops below threshold — not when a fixed sample size is hit

Why not just use a t-test?

A t-test answers one question: is the observed difference unlikely under the null? It cannot tell you:

  • How much you lose if you ship and you're wrong
  • Whether the effect is large enough to change user behaviour
  • What to do when conversion improves but latency degrades
  • Whether it's safe to stop the experiment early
  • How thin-segment estimates should borrow strength from larger segments

argonx answers all of these. The decision engine is the project — the models are plumbing.

Why not just use PyMC directly?

PyMC gives you posteriors and stops there. It has no concept of which variant to ship, what your business risk tolerance is, or whether your full policy is satisfied simultaneously. argonx is a genuine layer on top of PyMC — not a wrapper, not a replacement.


Models

Model Use case Data type
binary Conversion rate, click-through, churn 0/1 outcomes
lognormal Revenue, order value, session duration Right-skewed positive continuous
gaussian Latency, load time, scores Symmetric continuous
studentt Same as gaussian but robust to outliers Symmetric continuous with heavy tails
poisson Events per user, purchases per session Count data

Every model has a flat and hierarchical variant. Flat is selected by default. Hierarchical is selected automatically when segment_col is provided — no additional configuration required.

Guardrail metrics can use a different model than the primary metric:

experiment = Experiment(
    ...
    model='binary',                              # primary: conversion rate
    guardrail_models={'page_load_ms': 'lognormal'},  # guardrail: load time
)

What result.summary() Looks Like

============================================================
EXPERIMENT RESULTS
============================================================

PRIMARY METRIC
----------------------------------------
Best Variant: variant_b
Expected lift:    +4.3% (95% HDI: +1.0% to +7.0%)
P(best) across all variants: 0.971

RISK
----------------------------------------
Expected loss if wrong:          0.0009
CVaR (95th percentile loss):     0.0021
Risk level:                      low

PRACTICAL SIGNIFICANCE (ROPE)
----------------------------------------
Effect is OUTSIDE ROPE — practically meaningful.
P(practical effect): 0.941

GUARDRAILS
----------------------------------------
  page_load_ms              [FAIL]  variant=variant_b  P(degraded)=0.912  threshold=0.100

GUARDRAIL CONFLICTS DETECTED
----------------------------------------
Strong evidence for variant_b on primary metric.
Guardrail violation on page_load_ms with 91.2% probability.
Framework cannot resolve this tradeoff. Human review required.

============================================================
DECISION
----------------------------------------
State:          conflict
Recommendation: REVIEW REQUIRED
Confidence:     low

Reasoning:
  - P(best) exceeds strong threshold
  - Expected loss below configured maximum
  - Guardrail violation: page_load_ms cannot be resolved automatically
============================================================

The framework does not make the decision. It makes the right decision obvious.


Sequential Stopping

from argonx.sequential import StoppingChecker

checker = StoppingChecker(
    loss_threshold=0.01,
    prob_best_min=0.95,
    min_sample_size=1000,
)

# called at each checkpoint as data accumulates
status = checker.update(
    samples=result.samples,
    variant_names=['control', 'variant_b'],
    control='control',
    n_users_per_variant=n_counts,
)

print(status.safe_to_stop)
print(status.users_needed)   # approximate additional users if not safe to stop

checker.plot_trajectory()    # evidence accumulation over time

Bayesian sequential testing is valid at any checkpoint. Frequentist peeking inflates false positive rates — Bayesian expected-loss stopping does not. argonx tells you when evidence is strong enough, not when a predetermined sample size is reached.


Examples

Five real-world worked examples in examples/:

Notebook Scenario Key feature
01_ecommerce_checkout.ipynb Checkout redesign Guardrail conflict: conversion vs. load time
02_saas_revenue_sequential.ipynb SaaS pricing page Sequential stopping fires early
03_clinical_trial.ipynb Drug dosage protocol StudentT vs Gaussian: outlier robustness
04_gaming_matchmaking.ipynb Matchmaking algorithm 3-way multivariant, simultaneous argmax
05_mobile_personalisation.ipynb Fintech personalisation Hierarchical: iOS wins, Android neutral, thin tablet segment

Running Tests

# Fast — unit tests only, no MCMC (~60 seconds)
pytest tests/unit/

# Statistical property verification — no MCMC
pytest tests/math/

# Full suite including MCMC integration tests (slow)
pytest tests/

The test suite has three tiers matching the CI pipeline. Unit tests run on every push. Math tests run on every PR. Integration tests run on merge to main.


Contributing

Bug reports and PRs are welcome. Before opening a PR:

  • Run pytest tests/unit/ tests/math/ and confirm everything passes
  • For changes to the decision engine, add a test to tests/math/test_decision_sims.py that verifies the statistical property you're changing
  • For new model variants, add corresponding tests to tests/integration/test_models.py

Open an issue first for anything beyond bug fixes — architectural changes to the decision engine or new model types are worth discussing before implementation.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argonx-0.1.0.tar.gz (89.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

argonx-0.1.0-py3-none-any.whl (70.9 kB view details)

Uploaded Python 3

File details

Details for the file argonx-0.1.0.tar.gz.

File metadata

  • Download URL: argonx-0.1.0.tar.gz
  • Upload date:
  • Size: 89.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for argonx-0.1.0.tar.gz
Algorithm Hash digest
SHA256 59840b49d7758e5b9d2da924e6360f58d13d97390086250675fd1e3f91d7e3b0
MD5 41dc3e825af43445986db707c177ce4d
BLAKE2b-256 71719251d89f713f0b61ca608e44af8130f2c874ac099ab249bd020ce1b72b7e

See more details on using hashes here.

File details

Details for the file argonx-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: argonx-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 70.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for argonx-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 71ecb29bd23f94aab44a37c1e3e95081e4406d3aa2c6b29477327bf1e941f3ff
MD5 734029d93b77a6eda9b73dd61d5cc086
BLAKE2b-256 37ab45b26962e30ce0c70292dc2042e8d9ddeade49d68a3aa0796c0a3eb7fd71

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page