Skip to main content

Pytest plugin for Holm-Bonferroni correction of randomized tests

Project description

pytest-familywise

A pytest plugin for running multiple randomized tests while controlling the family-wise error rate (FWER) via the Holm-Bonferroni step-down procedure.

Motivation

A test suite that contains several independent statistical tests will, under the null hypothesis, produce at least one false positive with probability greater than the nominal level α. For $m$ independent tests each at level $\alpha$ the FWER is $1 - (1-\alpha)^m$. Holm-Bonferroni corrects for this without being as conservative as a plain Bonferroni adjustment.

The complication is that Holm-Bonferroni must process p-values from smallest to largest — the threshold for rank k depends on the total count m and all smaller p-values before it. This plugin defers pass/fail decisions: every test runs to completion first, p-values are collected, and then the procedure is applied once to the full set.

Installation and loading

Add the package as a dev dependency:

pip add --dev pytest-familywise

That is all that is needed. The package declares a pytest11 entry point:

# pyproject.toml
[project.entry-points."pytest11"]
random = "pytest_familywise"

pytest scans installed pytest11 entry points at startup and loads matching modules automatically. The fixtures (assertNotReject, ztest_sample_size, etc.) are defined at module level in pytest_familywise, so they become available in every test file without any import or conftest.py change.

Quick example

import numpy as np
import scipy.stats

def test_uniform_marginals(ks_sample_size, assertNotReject):
    """Each output coordinate of our RNG should be marginally uniform."""
    n = ks_sample_size(effect_size=0.05)   # detect CDF deviation >= 5 pp
    samples = np.random.rand(n)
    result = scipy.stats.kstest(samples, "uniform")
    assertNotReject(result.pvalue)


def test_normal_mean_zero(ztest_sample_size, assertNotReject):
    """Standardised output should have mean zero"""
    n = ztest_sample_size(effect_size=0.3)   # Cohen's d = 0.3
    samples = np.random.randn(n)
    _, p = scipy.stats.ttest_1samp(samples, 0.0)
    assertNotReject(p)


def test_discrete_distribution(chisquare_sample_size, assertNotReject):
    """A categorical sampler should match its target probabilities."""
    n = chisquare_sample_size(effect_size=0.2, df=4)   # Cohen's w = 0.2
    observed = np.random.multinomial(n, [0.2] * 5)
    _, p = scipy.stats.chisquare(observed)
    assertNotReject(p)

Run with:

pytest --holm-alpha=0.05 --power=0.8

After all three tests complete, the plugin applies Holm-Bonferroni and appends a summary to the terminal output:

============ Holm-Bonferroni correction  α=0.05  n=3 =============
  PASSED  p=0.312541  threshold=0.016667  test_rng.py::test_uniform_marginals
  PASSED  p=0.487302  threshold=0.025000  test_rng.py::test_normal_mean_zero
  PASSED  p=0.621088  threshold=0.050000  test_rng.py::test_discrete_distribution

  3 passed, 0 failed after Holm-Bonferroni correction

The exit code is non-zero if any test fails the corrected threshold.

How the step-down procedure works

Given $m$ tests with p-values sorted ascending as $p_1 \le p_2 \le \cdots \le p_m$:

  • At rank $k$, the threshold is $\alpha / (m - k + 1)$.
  • Starting from $k = 1$, reject $H_0$ while $p_k \le \text{threshold}$.
  • As soon as a p-value exceeds its threshold, stop rejecting — that test and all remaining ones fail.

This is more powerful than Bonferroni ($\alpha/m$ for all tests) because later ranks receive a relaxed threshold once earlier hypotheses have been rejected.

CLI options

Option Default Description
--holm-alpha 0.05 Family-wise error rate for the Holm-Bonferroni procedure
--power 0.8 Per-test power used by the sample-size fixtures

--power is per-test, not family-wise. The sample-size fixtures use Holm-Bonferroni corrected significance levels rather than the raw alpha. At collection time, the plugin counts the number of assertNotReject tests (m) and then assigns alpha / (m - k + 1) to the k-th test that requests a sample size, in execution order. The first test receives the most stringent threshold (alpha / m) and therefore the largest sample size; later tests receive progressively relaxed thresholds and smaller samples. Because of this, it is worth ordering your test suite so that more computationally expensive tests run later, where the required sample sizes are smaller.

Fixtures

assertNotReject

def test_something(assertNotReject):
    p = run_statistical_test()
    assertNotReject(p)   # registers the p-value; plugin decides pass/fail

The test passes if the null hypothesis is not rejected after Holm-Bonferroni correction (i.e. the p-value is large enough). It fails if H0 is rejected.

Calling assertNotReject(p) with a value outside [0, 1] raises ValueError. If a test raises an exception before assertNotReject is called, it fails normally and is excluded from the Holm-Bonferroni set.

ztest_sample_size

n = ztest_sample_size(effect_size=0.5)               # two-sided (default)
n = ztest_sample_size(effect_size=0.5, two_sided=False)

effect_size is Cohen's d. Uses the exact closed form:

$$n = \left\lceil \left(\frac{z_\alpha + z_\beta}{d}\right)^2 \right\rceil$$

Returns per-group n for a two-sample test.

chisquare_sample_size

n = chisquare_sample_size(effect_size=0.3, df=4)

effect_size is Cohen's $w = \sqrt{\sum (p_i - p_{0i})^2 / p_{0i}}$; df is the degrees of freedom (number of categories − 1 for goodness-of-fit). Solves numerically via the non-central χ² survival function.

ks_sample_size

n = ks_sample_size(effect_size=0.1)                 # one-sample
n = ks_sample_size(effect_size=0.1, two_sample=True) # per-group

effect_size is the maximum absolute CDF difference $|F - G|_\infty \in (0, 1]$. Uses the DKW-inequality bound:

$$n \ge \frac{\left(\sqrt{\ln(2/\alpha)} + \sqrt{\ln(2/\beta)}\right)^2}{2\Delta^2}$$

where $\beta = 1 - \text{power}$. For two_sample=True the effective sample size for the two-sample KS statistic is $n_1 n_2/(n_1+n_2) = n_\text{each}/2$ (equal groups), so the returned per-group count is double the formula above.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pytest_familywise-0.1.1.tar.gz (127.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pytest_familywise-0.1.1-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file pytest_familywise-0.1.1.tar.gz.

File metadata

  • Download URL: pytest_familywise-0.1.1.tar.gz
  • Upload date:
  • Size: 127.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pytest_familywise-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d425b5ba6c68db48c154e0a73aec551c74f8f79b28e20e9cfe208593460f85fb
MD5 51021af0f6ccc8a8441366aafffbeced
BLAKE2b-256 d95db53954e11dfded33c66ae592a8bef5f819b272c768c25c596fa012ccf2c0

See more details on using hashes here.

File details

Details for the file pytest_familywise-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: pytest_familywise-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.19 {"installer":{"name":"uv","version":"0.11.19","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for pytest_familywise-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dc86448799abd9aea36a0977293405d34b400363259de046517838b34ef79ad5
MD5 c024c0450bd996fa7c6a3232d6bd38c9
BLAKE2b-256 979f6b7d38ce66ccf95cbbed47e01ef23446dd7da79ec31d50311020629bce18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page