Skip to main content

Statistical hypothesis testing for persistence diagrams and barcodes

Project description

TDA-PHANTOM

Topological data analysis - Persistent Homology Analysis via Null Testing On Manifolds (TDA-PHANTOM) is a tool for statistically analysing significance of persistence diagrams and barcodes.

This project implements hypothesis tests from:

Installation

Via PyPI:

pip install tdaphantom

Or you can clone this repository - TDA-PHANTOM and install it manually:

python setup.py install

Overview

This tool can build a Vietoris-Rips complex from either a point cloud or distance matrix. It can then be used to visualise the persistence diagram for that complex, and run various hypothesis tests for it. The results of these hypothesis tests can be analysed via a return results array, or visualised in a signifiance persistence diagram.

Example Usage

Generate data

def _make_circle(n=2000, noise=0.03, seed=1):
    rng   = np.random.default_rng(seed)
    theta = rng.uniform(0, 2 * np.pi, n)
    pts   = np.stack([np.cos(theta), np.sin(theta)], axis=1)
    return pts + rng.normal(0, noise, pts.shape)
pc_circle = _make_circle()

Init Phantom class

phantom_circle = Phantom(pc_circle)

Calculate persistence diagram

Here we go up to homological dimension 1

phantom_circle.calculate_dgms_from_point_cloud_ripser(k=1)

Display persistence diagram

phantom_circle.display_dgms()

![Persistence diagram for a circle using phatom](Persistence diagram)

Run hypothesis test

alpha      = 0.01
correction = None
methods    = ["universal_null"]
phantom_circle.hypothesis_test(alpha, methods=methods, k=1)

Display signifiance persistence diagram

phantom_circle.display_results()

![signifiance Persistence diagram for a circle using phatom](Persistence diagram)

Basic useage

The general workflow is:

1. Initalise Phantom

Initalise with either a point cloud or distance matrix. If using a distance matrix, set is_distance_matrix=True.

from tdaphantom.tdaphantom import Phantom

# from a point cloud
phantom = Phantom(point_cloud)

# from a distance matrix
phantom = Phantom(distance_matrix, is_distance_matrix=True)

2. Generate persistence diagrams

Two backends are available — GUDHI and Ripser. Ripser is faster for Vietoris-Rips persistence. Both take k as a convenience argument to set the maximum homological dimension to compute.

This will work even if you have inputted a distance matrix

# gudhi backend
phantom.calculate_dgms_from_point_cloud(k=1)

# ripser backend
phantom.calculate_dgms_from_point_cloud_ripser(k=1)

3. Display persistence diagrams

Call display_dgms to plot the persistence diagram and/or barcode. Each homological dimension is shown in a different colour. You may also pass in a pre-made diagram dict of the form {0: dgm_0, 1: dgm_1, ...}.

phantom.display_dgms()                  # both diagram and barcode
phantom.display_dgms(plot="diagram")    # persistence diagram only
phantom.display_dgms(plot="barcode")    # barcode only

4. Run hypothesis tests

Run one or more hypothesis tests on the persistence diagram for a given dimension k. You may optionally pass per-method options as a list of dicts aligned to the methods list. Note: bottleneck methods require the point cloud or distance matrix to be available.

The below code snippet shows the default values

results = phantom.hypothesis_test(
    alpha   = 0.05,
    methods = ["universal_null", "bottleneck"],
    k = 1,
)

With custom options:

results = phantom.hypothesis_test(
    alpha   = 0.05,
    methods = ["universal_null", "bottleneck"],
    options = [
        {"correction_strategy": "Bonferroni"},   # options for universal_null
        {"max_depth": 100, "b_multiplier": 3.5}, # options for bottleneck
    ],
    k = 1,
)

5. Display results

Call display_results to visualise which bars are significant. Significant bars/points are shown in red, noise in blue. A threshold line is drawn on the signifiance persistence diagram showing the significance boundary.

phantom.display_results()                        # all methods, both plots
phantom.display_results(method="universal_null") # specific method
phantom.display_results(plot="diagram")          # signifiance persistence diagram only

6. Save and load

At any point you can save the full state of Phantom to disk and reload it later.

phantom.save("./saved/circle.phantom")
phantom = Phantom.load("./saved/circle.phantom")

Avaliable hypothesis methods

Method Alias Requires
universal_null:median universal_null diagram only
universal_null:mean diagram only
bottleneck:subsample bottleneck point cloud or distance matrix

Universal null median

Useage

phantom.hypothesis_test(alpha=0.05, methods=["universal_null"], k=1)

With options:

phantom.hypothesis_test(
    alpha   = 0.05,
    methods = ["universal_null"],
    options = [{"correction_strategy": "Bonferroni", "max_threshold": 10.0}],
    k       = 1,
)

Available options:

Option Default Description
correction_strategy None Multiple testing correction. "Bonferroni" or "BH" for Benjamini hochberg or None.
max_threshold 10 * max finite death Threshold substituted for infinite death values.
max_depth 1000 Maximum iterations for the infinite cycle threshold algorithm.

Theory

Bobrowski and Skraba (2023) show that the death/birth ratio $\pi = \text{death}/\text{birth}$ for noise bars follows a universal limiting distribution — the left-skewed Gumbel (LGumbel) — regardless of the underlying point cloud distribution or geometry.

The normalised statistic for each bar is:

\ell(p) = A \cdot \log \log \pi(p) + B

where:

\pi(p) = \frac{\text{death}}{\text{birth}}, \quad
A = 1 \text{ (Vietoris-Rips)}, \quad
B = -\gamma - A \cdot \hat{L}, \quad
\hat{L} = \text{median}(\log \log \pi) \text{ over all finite bars}, \quad
\gamma \approx 0.5772  \text{ (Euler-Mascheroni constant)},

Under the null hypothesis (noise), $\ell(p)$ follows $\text{LGumbel}(0,1)$. The p-value for each bar is the LGumbel survival function:

p_i = \exp(-\exp(\ell_i))

Significance is assessed after Bonferroni or BH correction across all bars. The threshold line on the persistence diagram is $\text{death} = \text{birth} \cdot \pi^$ where $\pi^$ is the minimum $\pi$ such that a bar would be rejected at the corrected alpha level.

Infinite bars are handled with Bobrowski and Skraba (2023) algorithm 1

τ ← τ₀
do
    D ← dgmₖ(τ)
    I ← { b : (b, d) ∈ D, d = ∞, and τ/b < π_min(α/|D|) }
    τ ← τ_new
while |I| > 0
return τ

where:

\tau_{\text{new}} = \begin{cases} \min(I) \cdot \pi_{\min}(\alpha / |D|) & I \neq \emptyset \\ \tau & I = \emptyset \end{cases}

where $\tau_0$ is the initial threshold, $\text{dgm}k(\tau)$ is the persistence diagram truncated at $\tau$, and $\pi{\min}(\alpha/|D|)$ is the minimum death/birth ratio such that a bar is significant at the Bonferroni-corrected level $\alpha/|D|$.

Universal null mean

Useage

phantom.hypothesis_test(alpha=0.05, methods=["universal_null:mean"], k=1)

Theory

Identical to universal null median, but $\hat{L}$ is estimated using the mean of $\log \log \pi$ rather than the median. The paper's exact formula uses the mean. The median is more robust to contamination from signal bars, but the mean is theoretically motivated by the paper's universality result.

Bottleneck subsampling

Useage

phantom.hypothesis_test(alpha=0.05, methods=["bottleneck"], k=1)

With options:

phantom.hypothesis_test(
    alpha   = 0.05,
    methods = ["bottleneck"],
    options = [{"max_depth": 100, "b_multiplier": 3.5}],
    k       = 1,
)

Available options:

Option Default Description
max_depth 50 Number of bootstrap subsamples N.
b_multiplier 3.5 The constant used for $O(\frac{n}{\log(n)})$. Larger values give smaller c_n and more power but looser theoretical guarantees.

Theory

Fasy et al. (2014) 4.1 derive a confidence set for the persistence diagram of the form:

\mathcal{C} = \{ Q : W_\infty(\hat{P}, P) \leq c_n \}

where $W_\infty$ is the bottleneck distance and $c_n$ is estimated by subsampling.

For each of $N$ subsamples $S^*_b$ of size $b$ drawn without replacement from $S_n$:

T_j = H(S_n, S^*_b) \quad \text{(Hausdorff distance between point sets)}
c_n = \text{quantile}(\{T_j\},\, 1 - \alpha)

By the bottleneck stability theorem, $W_\infty(\hat{P}, P) \leq H(S_n, M)$, so bars with persistence $> 2 c_n$ are significant at level $\alpha$.

The p-value for bar $i$ is the fraction of bootstrap distances exceeding half its persistence:

p_i = \frac{1}{N} \sum_j \mathbf{1}\!\left(T_j \geq \frac{\ell_i}{2}\right)

The error bound from the paper is:

P(H(S_n, M) > c_n) \leq \alpha + O\!\left(\left(\frac{b}{n}\right)^{1/4}\right)

so the bias term shrinks as $b/n \to 0$. In practice, larger $b$ gives more power at the cost of strict theoretical guarantees on the type I error. The default

b = int(3.5*(n / np.log(n)))

is a practical choice.

Bottleneck shells

Useage

TODO: not yet implemented.

# phantom.hypothesis_test(alpha=0.05, methods=["bottleneck:shells"], k=1)

Available options:

Option Default Description
max_depth 50 Number of bootstrap subsamples N.

Theory

TODO: not yet implemented.

Fasy et al. (2014) 4.3 Note: the method of shells requires that $P$ satisfies a uniform lower bound on local density within each shell. It does not apply to distributions with arbitrarily sparse regions.

Bottleneck density

Useage

TODO: not yet implemented.

# phantom.hypothesis_test(alpha=0.05, methods=["bottleneck:density"], k=1)

Available options:

Option Default Description
max_depth 50 Number of bootstrap subsamples N.
bandwidth None KDE bandwidth h. Estimated from data if not supplied.

Theory

TODO: not yet implemented.

Fasy et al. (2014) 4.4 estimates the underlying density $p$ using a kernel density estimator $\hat{p}_h$ and computes the persistence diagram of the upper level sets of $\hat{p}_h$.

Let $\hat{p}_h$ be a KDE with bandwidth $h$:

The density method is robust to outliers because the KDE 'smooths them out', unlike the distance-function methods which are sensitive to isolated points far from $M$.

Bottleneck concentration

Useage

TODO: not yet implemented.

# phantom.hypothesis_test(alpha=0.05, methods=["bottleneck:concentration"], k=1)

Available options:

Option Default Description
max_depth 50 Number of bootstrap subsamples N.
rho None Minimum local density of P on M. Estimated from data if not supplied.
d None Intrinsic dimension of M. Inferred from point cloud if not supplied.

Theory

TODO: not yet implemented. Fasy et al. (2014) 4.2

TODO

  • Add bottleneck shells
  • Add bottleneck density
  • Add bottleneck concentration
  • Add more integeration tests
  • Add more unit tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tdaphantom-1.0.4.tar.gz (20.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tdaphantom-1.0.4-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file tdaphantom-1.0.4.tar.gz.

File metadata

  • Download URL: tdaphantom-1.0.4.tar.gz
  • Upload date:
  • Size: 20.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for tdaphantom-1.0.4.tar.gz
Algorithm Hash digest
SHA256 002d3a35c4d6e1d52c71eefd346b603d7727ded7ed4a87ab4858908f80927db1
MD5 deb4dbe14550f9d614a5b92df2e30116
BLAKE2b-256 9142dd0adcf39c38dafb28aa25ce929d9d199d7f359ebc26e8ae4e66513c5783

See more details on using hashes here.

File details

Details for the file tdaphantom-1.0.4-py3-none-any.whl.

File metadata

  • Download URL: tdaphantom-1.0.4-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for tdaphantom-1.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 05bcd8635d13906b313489591887a881fef36bdb2a6317c340d62d175c1a5491
MD5 7705f538db6127b9cb35c50b317e3235
BLAKE2b-256 c3acb158406d785c422b70049d5358eaef38b15f7f7221d013c4215b42bf09b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page