Bayesian decision engine for A/B testing
Project description
argonx
argonx is a Bayesian decision engine for A/B experiments.
It handles inference, multi-metric risk management, hierarchical segment-aware analysis, and sequential stopping. Feed it your data, tell it what matters, and it surfaces everything you need to make the right call.
Install
pip install argonx
# development install
git clone https://github.com/souro26/bayesian-a-b-testing.git
cd bayesian-a-b-testing
pip install -e .
Quick Start
from argonx import Experiment
experiment = Experiment(
data=df,
variant_col='variant',
primary_metric='revenue',
guardrails=['page_load_ms'],
lower_is_better={'page_load_ms': True},
model='lognormal',
guardrail_models={'page_load_ms': 'gaussian'},
control='control',
)
result = experiment.run()
result.summary()
result.plot()
Ratio metrics via callable, no extra classes needed:
experiment = Experiment(
data=df,
variant_col='variant',
primary_metric=lambda df: df['clicks'] / df['impressions'],
model='lognormal',
control='control',
)
Segment-aware hierarchical inference, one extra argument:
experiment = Experiment(
data=df,
variant_col='variant',
segment_col='device_type',
primary_metric='revenue',
model='lognormal',
control='control',
)
result = experiment.run()
result.summary() # aggregate, population-level
result.segment_summary() # per-segment decisions, cross-segment conflict detection
What It Computes
Most testing frameworks answer is there an effect? argonx answers what you should do about it, and how much you lose if you get it wrong.
A p-value tells you whether the observed difference is unlikely under the null. It does not tell you which variant to ship. argonx computes the quantities that actually drive that decision:
| Metric | What it answers |
|---|---|
| P(variant is best) | Posterior probability of being the true winner, computed via simultaneous argmax across all N variants. Not pairwise. |
| Expected loss | Average loss if you ship the wrong variant, integrated over the full posterior. Not a point estimate. |
| CVaR | Expected loss in the worst-case tail. Catches cases where average loss looks fine but tail outcomes are catastrophic. |
| ROPE | Is the effect large enough to matter in practice? A statistically real effect can still be business-irrelevant. ROPE separates these. |
| HDI | The actual posterior probability interval. The lift falls inside this range with 95% posterior probability. |
| Joint probability | P(all business conditions satisfied simultaneously), not independent per-metric checks that miss correlations. |
| Composite score | Weighted multi-metric business impact, computed draw-by-draw from posteriors, not from means. |
| Guardrail conflict | When the primary metric improves and a guardrail degrades, the framework surfaces the conflict and stops there. No arbitrary resolution. |
| Sequential stopping | Stop when expected loss drops below your threshold, not when a fixed sample size is reached. |
What result.summary() Looks Like
============================================================
EXPERIMENT RESULTS
============================================================
PRIMARY METRIC
----------------------------------------
Best Variant: variant_b
Expected lift: +4.3% (95% HDI: +1.0% to +7.0%)
P(best) across all variants: 0.971
RISK
----------------------------------------
Expected loss if wrong: 0.0009
CVaR (95th percentile loss): 0.0021
Risk level: low
PRACTICAL SIGNIFICANCE (ROPE)
----------------------------------------
Effect is OUTSIDE ROPE -- practically meaningful.
P(practical effect): 0.941
GUARDRAILS
----------------------------------------
page_load_ms [FAIL] variant=variant_b P(degraded)=0.912 threshold=0.100
GUARDRAIL CONFLICTS DETECTED
----------------------------------------
Strong evidence for variant_b on primary metric.
Guardrail violation on page_load_ms with 91.2% probability.
Framework cannot resolve this tradeoff. Human review required.
============================================================
DECISION
----------------------------------------
State: conflict
Recommendation: REVIEW REQUIRED
Confidence: low
Reasoning:
- P(best) exceeds strong threshold
- Expected loss below configured maximum
- Guardrail violation: page_load_ms cannot be resolved automatically
============================================================
The framework does not make the decision. It makes the right decision obvious.
Models
| Model | Use case | Data type |
|---|---|---|
binary |
Conversion rate, click-through, churn | 0/1 outcomes |
lognormal |
Revenue, order value, session duration | Right-skewed positive continuous |
gaussian |
Latency, load time, scores | Symmetric continuous |
studentt |
Same as gaussian, robust to outliers | Symmetric continuous with heavy tails |
poisson |
Events per user, purchases per session | Count data |
Every model has a flat and hierarchical variant. Flat is the default. Hierarchical is selected automatically when segment_col is set. Partial pooling handles thin segments by borrowing strength from larger ones without collapsing differences that are real.
Guardrail metrics can use a different model than the primary:
experiment = Experiment(
...
model='binary',
guardrail_models={'page_load_ms': 'lognormal'},
)
Sequential Stopping
from argonx.sequential import StoppingChecker
checker = StoppingChecker(
loss_threshold=0.01,
prob_best_min=0.95,
min_sample_size=1000,
)
status = checker.update(
samples=result.samples,
variant_names=['control', 'variant_b'],
control='control',
n_users_per_variant=n_counts,
)
print(status.safe_to_stop)
print(status.users_needed) # estimated additional users needed if not safe
checker.plot_trajectory() # P(best) and expected loss over time
Frequentist peeking inflates false positive rates. Bayesian expected-loss stopping does not. argonx stops when evidence is strong enough, and tells you how far you are from that threshold when it is not.
Examples
Five worked examples across different industries and model types in examples/:
| Notebook | Scenario | Key feature |
|---|---|---|
01_ecommerce_checkout.ipynb |
Checkout redesign | Guardrail conflict: conversion up, load time up |
02_saas_revenue_sequential.ipynb |
SaaS pricing page | Sequential stopping fires at week 2 of 4 |
03_clinical_trial.ipynb |
Drug dosage protocol | StudentT vs Gaussian on data with outliers |
04_gaming_matchmaking.ipynb |
Matchmaking algorithm | 3-way experiment, simultaneous argmax |
05_mobile_personalisation.ipynb |
Fintech personalisation | Hierarchical: segment conflict, thin-segment pooling |
Running Tests
# unit tests only, no MCMC, fast
pytest tests/unit/
# statistical property verification, no MCMC
pytest tests/math/
# full suite including MCMC integration tests
pytest tests/
Three tiers matching the CI pipeline. Unit tests on every push. Math tests on every PR. Integration tests on merge to main.
Contributing
Open an issue before submitting anything beyond a bug fix. PRs are welcome.
Before opening a PR, run pytest tests/unit/ tests/math/ and confirm everything passes. For decision engine changes, add a test to tests/math/test_decision_sims.py that verifies the statistical property. For new model variants, add tests to tests/integration/test_models.py.
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file argonx-0.1.1.tar.gz.
File metadata
- Download URL: argonx-0.1.1.tar.gz
- Upload date:
- Size: 88.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
092616378788d110bea13b41b02dd6e20746d17b8fffd5917f6cbfcb283a97cd
|
|
| MD5 |
44cb3267d657f81073a9894e893bce90
|
|
| BLAKE2b-256 |
44cac285a1e8005149bc022d8f89a024e424fda73202b99294638c1017ebe41c
|
File details
Details for the file argonx-0.1.1-py3-none-any.whl.
File metadata
- Download URL: argonx-0.1.1-py3-none-any.whl
- Upload date:
- Size: 70.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54d7c5841f87ec873faa2d0b773da3a45b2f40b5756ccc0a9fc24eda73973154
|
|
| MD5 |
bb34b28e3d6c8c51a5c3fcebf9c30697
|
|
| BLAKE2b-256 |
66f16698d796eb77437c255dea96da0f335c38fd4e7bc7f64ee0c4cf33cedd1c
|