Pytest plugin for Holm-Bonferroni correction of randomized tests
Project description
pytest-familywise
A pytest plugin for running multiple randomized tests while controlling the
family-wise error rate (FWER) via the Holm-Bonferroni step-down procedure.
Motivation
A test suite that contains several independent statistical tests will, under the null hypothesis, produce at least one false positive with probability greater than the nominal level α. For $m$ independent tests each at level $\alpha$ the FWER is $1 - (1-\alpha)^m$. Holm-Bonferroni corrects for this without being as conservative as a plain Bonferroni adjustment.
The complication is that Holm-Bonferroni must process p-values from smallest to largest — the threshold for rank k depends on the total count m and all smaller p-values before it. This plugin defers pass/fail decisions: every test runs to completion first, p-values are collected, and then the procedure is applied once to the full set.
Installation and loading
Add the package as a dev dependency:
pip add --dev pytest-familywise
That is all that is needed. The package declares a pytest11 entry point:
# pyproject.toml
[project.entry-points."pytest11"]
random = "pytest_familywise"
pytest scans installed pytest11 entry points at startup and loads matching
modules automatically. The fixtures (assertNotReject, ztest_sample_size, etc.) are
defined at module level in pytest_familywise, so they become available in every
test file without any import or conftest.py change.
Quick example
import numpy as np
import scipy.stats
def test_uniform_marginals(ks_sample_size, assertNotReject):
"""Each output coordinate of our RNG should be marginally uniform."""
n = ks_sample_size(effect_size=0.05) # detect CDF deviation >= 5 pp
samples = np.random.rand(n)
result = scipy.stats.kstest(samples, "uniform")
assertNotReject(result.pvalue)
def test_normal_mean_zero(ztest_sample_size, assertNotReject):
"""Standardised output should have mean zero"""
n = ztest_sample_size(effect_size=0.3) # Cohen's d = 0.3
samples = np.random.randn(n)
_, p = scipy.stats.ttest_1samp(samples, 0.0)
assertNotReject(p)
def test_discrete_distribution(chisquare_sample_size, assertNotReject):
"""A categorical sampler should match its target probabilities."""
n = chisquare_sample_size(effect_size=0.2, df=4) # Cohen's w = 0.2
observed = np.random.multinomial(n, [0.2] * 5)
_, p = scipy.stats.chisquare(observed)
assertNotReject(p)
Run with:
pytest --holm-alpha=0.05 --power=0.8
After all three tests complete, the plugin applies Holm-Bonferroni and appends a summary to the terminal output:
============ Holm-Bonferroni correction α=0.05 n=3 =============
PASSED p=0.312541 threshold=0.016667 test_rng.py::test_uniform_marginals
PASSED p=0.487302 threshold=0.025000 test_rng.py::test_normal_mean_zero
PASSED p=0.621088 threshold=0.050000 test_rng.py::test_discrete_distribution
3 passed, 0 failed after Holm-Bonferroni correction
The exit code is non-zero if any test fails the corrected threshold.
How the step-down procedure works
Given $m$ tests with p-values sorted ascending as $p_1 \le p_2 \le \cdots \le p_m$:
- At rank $k$, the threshold is $\alpha / (m - k + 1)$.
- Starting from $k = 1$, reject $H_0$ while $p_k \le \text{threshold}$.
- As soon as a p-value exceeds its threshold, stop rejecting — that test and all remaining ones fail.
This is more powerful than Bonferroni ($\alpha/m$ for all tests) because later ranks receive a relaxed threshold once earlier hypotheses have been rejected.
CLI options
| Option | Default | Description |
|---|---|---|
--holm-alpha |
0.05 |
Family-wise error rate for the Holm-Bonferroni procedure |
--power |
0.8 |
Per-test power used by the sample-size fixtures |
--power is per-test, not family-wise. The sample-size fixtures use
Holm-Bonferroni corrected significance levels rather than the raw alpha. At
collection time, the plugin counts the number of assertNotReject tests (m)
and then assigns alpha / (m - k + 1) to the k-th test that requests a
sample size, in execution order. The first test receives the most stringent
threshold (alpha / m) and therefore the largest sample size; later tests
receive progressively relaxed thresholds and smaller samples. Because of this,
it is worth ordering your test suite so that more computationally expensive
tests run later, where the required sample sizes are smaller.
Fixtures
assertNotReject
def test_something(assertNotReject):
p = run_statistical_test()
assertNotReject(p) # registers the p-value; plugin decides pass/fail
The test passes if the null hypothesis is not rejected after Holm-Bonferroni correction (i.e. the p-value is large enough). It fails if H0 is rejected.
Calling assertNotReject(p) with a value outside [0, 1] raises ValueError.
If a test raises an exception before assertNotReject is called, it fails
normally and is excluded from the Holm-Bonferroni set.
ztest_sample_size
n = ztest_sample_size(effect_size=0.5) # two-sided (default)
n = ztest_sample_size(effect_size=0.5, two_sided=False)
effect_size is Cohen's d. Uses the exact closed form:
$$n = \left\lceil \left(\frac{z_\alpha + z_\beta}{d}\right)^2 \right\rceil$$
Returns per-group n for a two-sample test.
chisquare_sample_size
n = chisquare_sample_size(effect_size=0.3, df=4)
effect_size is Cohen's $w = \sqrt{\sum (p_i - p_{0i})^2 / p_{0i}}$; df is the degrees of
freedom (number of categories − 1 for goodness-of-fit). Solves numerically via
the non-central χ² survival function.
ks_sample_size
n = ks_sample_size(effect_size=0.1) # one-sample
n = ks_sample_size(effect_size=0.1, two_sample=True) # per-group
effect_size is the maximum absolute CDF difference $|F - G|_\infty \in (0, 1]$.
Uses the DKW-inequality bound:
$$n \ge \frac{\left(\sqrt{\ln(2/\alpha)} + \sqrt{\ln(2/\beta)}\right)^2}{2\Delta^2}$$
where $\beta = 1 - \text{power}$. For two_sample=True the effective sample size for the
two-sample KS statistic is $n_1 n_2/(n_1+n_2) = n_\text{each}/2$ (equal groups), so the
returned per-group count is double the formula above.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pytest_familywise-0.1.1.tar.gz.
File metadata
- Download URL: pytest_familywise-0.1.1.tar.gz
- Upload date:
- Size: 127.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d425b5ba6c68db48c154e0a73aec551c74f8f79b28e20e9cfe208593460f85fb
|
|
| MD5 |
51021af0f6ccc8a8441366aafffbeced
|
|
| BLAKE2b-256 |
d95db53954e11dfded33c66ae592a8bef5f819b272c768c25c596fa012ccf2c0
|
File details
Details for the file pytest_familywise-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pytest_familywise-0.1.1-py3-none-any.whl
- Upload date:
- Size: 9.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc86448799abd9aea36a0977293405d34b400363259de046517838b34ef79ad5
|
|
| MD5 |
c024c0450bd996fa7c6a3232d6bd38c9
|
|
| BLAKE2b-256 |
979f6b7d38ce66ccf95cbbed47e01ef23446dd7da79ec31d50311020629bce18
|