Bayesian decision engine for A/B testing

These details have not been verified by PyPI

Project description

argonx

License Python

argonx is a decision-support system for A/B experiments. It handles Bayesian inference, multi-metric risk management, hierarchical segment-aware analysis, and sequential stopping — and surfaces a complete evidential picture so the right decision is obvious.

Most testing frameworks answer "is there an effect?" argonx answers "what should you do about it, and how much do you lose if you're wrong?"

Install

pip install git+https://github.com/souro26/bayesian-a-b-testing.git

# or, for local development
git clone https://github.com/souro26/bayesian-a-b-testing.git
cd bayesian-a-b-testing
pip install -e .

Quick Example

from argonx import Experiment

experiment = Experiment(
    data=df,
    variant_col='variant',
    primary_metric='revenue',
    guardrails=['page_load_ms'],
    lower_is_better={'page_load_ms': True},
    model='lognormal',
    guardrail_models={'page_load_ms': 'gaussian'},
    control='control',
)

result = experiment.run()
result.summary()
result.plot()

For ratio metrics, pass a callable directly — no class system needed:

experiment = Experiment(
    data=df,
    variant_col='variant',
    primary_metric=lambda df: df['clicks'] / df['impressions'],
    model='lognormal',
    control='control',
)

For segment-aware hierarchical inference, add one argument:

experiment = Experiment(
    data=df,
    variant_col='variant',
    segment_col='device_type',          # triggers hierarchical model automatically
    primary_metric='revenue',
    model='lognormal',
    control='control',
)

result = experiment.run()
result.summary()           # aggregate, population-level
result.segment_summary()   # per-segment decisions + cross-segment conflict detection

What It Computes

A p-value tells you the probability of seeing data this extreme if the null is true. It does not tell you what to do. argonx computes the quantities that actually drive decisions:

Metric	What it answers
P(variant is best)	Which variant has the highest posterior probability of being the true winner — computed via simultaneous argmax across all N variants, not pairwise comparison
Expected loss	How much you lose on average if you ship the wrong variant — integrated over the full posterior, not a point estimate
CVaR	Expected loss in the worst-case tail — catches cases where the average loss looks fine but catastrophic outcomes are possible
ROPE	Is the effect large enough to matter in practice? An effect can be statistically real and business-irrelevant. ROPE separates these
HDI	The actual probability interval — not a frequentist confidence interval. The lift is inside this range with 95% posterior probability
Joint probability	P(all business conditions satisfied simultaneously) — not per-metric checks that miss correlations
Composite score	Weighted multi-metric business impact, computed draw-by-draw from posteriors
Guardrail conflict	When the primary metric improves and a guardrail degrades, the framework surfaces the conflict clearly rather than resolving it arbitrarily
Sequential stopping	Evidence-based stopping signal. Stop when expected loss drops below threshold — not when a fixed sample size is hit

Why not just use a t-test?

A t-test answers one question: is the observed difference unlikely under the null? It cannot tell you:

How much you lose if you ship and you're wrong
Whether the effect is large enough to change user behaviour
What to do when conversion improves but latency degrades
Whether it's safe to stop the experiment early
How thin-segment estimates should borrow strength from larger segments

argonx answers all of these. The decision engine is the project — the models are plumbing.

Why not just use PyMC directly?

PyMC gives you posteriors and stops there. It has no concept of which variant to ship, what your business risk tolerance is, or whether your full policy is satisfied simultaneously. argonx is a genuine layer on top of PyMC — not a wrapper, not a replacement.

Models

Model	Use case	Data type
`binary`	Conversion rate, click-through, churn	0/1 outcomes
`lognormal`	Revenue, order value, session duration	Right-skewed positive continuous
`gaussian`	Latency, load time, scores	Symmetric continuous
`studentt`	Same as gaussian but robust to outliers	Symmetric continuous with heavy tails
`poisson`	Events per user, purchases per session	Count data

Every model has a flat and hierarchical variant. Flat is selected by default. Hierarchical is selected automatically when segment_col is provided — no additional configuration required.

Guardrail metrics can use a different model than the primary metric:

experiment = Experiment(
    ...
    model='binary',                              # primary: conversion rate
    guardrail_models={'page_load_ms': 'lognormal'},  # guardrail: load time
)

What `result.summary()` Looks Like

============================================================
EXPERIMENT RESULTS
============================================================

PRIMARY METRIC
----------------------------------------
Best Variant: variant_b
Expected lift:    +4.3% (95% HDI: +1.0% to +7.0%)
P(best) across all variants: 0.971

RISK
----------------------------------------
Expected loss if wrong:          0.0009
CVaR (95th percentile loss):     0.0021
Risk level:                      low

PRACTICAL SIGNIFICANCE (ROPE)
----------------------------------------
Effect is OUTSIDE ROPE — practically meaningful.
P(practical effect): 0.941

GUARDRAILS
----------------------------------------
  page_load_ms              [FAIL]  variant=variant_b  P(degraded)=0.912  threshold=0.100

GUARDRAIL CONFLICTS DETECTED
----------------------------------------
Strong evidence for variant_b on primary metric.
Guardrail violation on page_load_ms with 91.2% probability.
Framework cannot resolve this tradeoff. Human review required.

============================================================
DECISION
----------------------------------------
State:          conflict
Recommendation: REVIEW REQUIRED
Confidence:     low

Reasoning:
  - P(best) exceeds strong threshold
  - Expected loss below configured maximum
  - Guardrail violation: page_load_ms cannot be resolved automatically
============================================================

The framework does not make the decision. It makes the right decision obvious.

Sequential Stopping

from argonx.sequential import StoppingChecker

checker = StoppingChecker(
    loss_threshold=0.01,
    prob_best_min=0.95,
    min_sample_size=1000,
)

# called at each checkpoint as data accumulates
status = checker.update(
    samples=result.samples,
    variant_names=['control', 'variant_b'],
    control='control',
    n_users_per_variant=n_counts,
)

print(status.safe_to_stop)
print(status.users_needed)   # approximate additional users if not safe to stop

checker.plot_trajectory()    # evidence accumulation over time

Bayesian sequential testing is valid at any checkpoint. Frequentist peeking inflates false positive rates — Bayesian expected-loss stopping does not. argonx tells you when evidence is strong enough, not when a predetermined sample size is reached.

Examples

Five real-world worked examples in examples/:

Notebook	Scenario	Key feature
`01_ecommerce_checkout.ipynb`	Checkout redesign	Guardrail conflict: conversion vs. load time
`02_saas_revenue_sequential.ipynb`	SaaS pricing page	Sequential stopping fires early
`03_clinical_trial.ipynb`	Drug dosage protocol	StudentT vs Gaussian: outlier robustness
`04_gaming_matchmaking.ipynb`	Matchmaking algorithm	3-way multivariant, simultaneous argmax
`05_mobile_personalisation.ipynb`	Fintech personalisation	Hierarchical: iOS wins, Android neutral, thin tablet segment

Running Tests

# Fast — unit tests only, no MCMC (~60 seconds)
pytest tests/unit/

# Statistical property verification — no MCMC
pytest tests/math/

# Full suite including MCMC integration tests (slow)
pytest tests/

The test suite has three tiers matching the CI pipeline. Unit tests run on every push. Math tests run on every PR. Integration tests run on merge to main.

Contributing

Bug reports and PRs are welcome. Before opening a PR:

Run pytest tests/unit/ tests/math/ and confirm everything passes
For changes to the decision engine, add a test to tests/math/test_decision_sims.py that verifies the statistical property you're changing
For new model variants, add corresponding tests to tests/integration/test_models.py

Open an issue first for anything beyond bug fixes — architectural changes to the decision engine or new model types are worth discussing before implementation.

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.5

May 20, 2026

0.1.4

May 16, 2026

0.1.3

May 16, 2026

0.1.2

May 12, 2026

0.1.1

May 5, 2026

This version

0.1.0

May 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

argonx-0.1.0.tar.gz (89.6 kB view details)

Uploaded May 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

argonx-0.1.0-py3-none-any.whl (70.9 kB view details)

Uploaded May 5, 2026 Python 3

File details

Details for the file argonx-0.1.0.tar.gz.

File metadata

Download URL: argonx-0.1.0.tar.gz
Upload date: May 5, 2026
Size: 89.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for argonx-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`59840b49d7758e5b9d2da924e6360f58d13d97390086250675fd1e3f91d7e3b0`
MD5	`41dc3e825af43445986db707c177ce4d`
BLAKE2b-256	`71719251d89f713f0b61ca608e44af8130f2c874ac099ab249bd020ce1b72b7e`

See more details on using hashes here.

File details

Details for the file argonx-0.1.0-py3-none-any.whl.

File metadata

Download URL: argonx-0.1.0-py3-none-any.whl
Upload date: May 5, 2026
Size: 70.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for argonx-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`71ecb29bd23f94aab44a37c1e3e95081e4406d3aa2c6b29477327bf1e941f3ff`
MD5	`734029d93b77a6eda9b73dd61d5cc086`
BLAKE2b-256	`37ab45b26962e30ce0c70292dc2042e8d9ddeade49d68a3aa0796c0a3eb7fd71`

See more details on using hashes here.

argonx 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

argonx

Install

Quick Example

What It Computes

Why not just use a t-test?

Why not just use PyMC directly?

Models

What `result.summary()` Looks Like

Sequential Stopping

Examples

Running Tests

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

argonx 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

argonx

Install

Quick Example

What It Computes

Why not just use a t-test?

Why not just use PyMC directly?

Models

What result.summary() Looks Like

Sequential Stopping

Examples

Running Tests

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What `result.summary()` Looks Like