Skip to main content

Minimum spanning tree based manifold approximations.

Project description

PyPI version Tests

Manifold Modelling with Minimum Spanning Trees

Dimensionality reduction (DR) algorithms typically assume the data they are given is uniformly sampled from some underlying manifold. When this is not the case, and there are observation-gaps along the manifold, these algorithms may fail to detect a single connected entity. This repository presents two manifold approximation approaches based on minimum spanning trees (MST) for non-uniform sampled data.

Noisy Minimum Spanning Tree Union

The noisy minimum spanning tree union ($n$-MST) is inspired by Pathfinder networks that, with a specific parameter selection, yield the union set of all possible MSTs in a network (see, e.g., [1], [2]). We compute noisy MSTs to detect alternative connectivity at all distance scales for distances which may have few identically weighted connections.

We add Gaussian noise ($\mu=0$) to every candidate edge. The noise parameter $n$ is specified as a fraction of the points' nearest neighbour distance and controls the Gaussian's standard deviation. This formulation makes the noise scale with the data's density to avoid adding more edges in dense regions than sparse regions, retaining a reasonably uniform manifold approximation graph.

import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.noisy_mst import NoisyMST

X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = NoisyMST(num_trees=10, noise_fraction=1.0).fit(X)

# Drawing the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col

plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
    linewidth=0.2,
    zorder=-1,
    alpha=0.5,
    color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()

noisy_mst

$k$-Nearest Minimum Spanning Tree

The k-nearest Minimum Spanning Tree ($k$-MST) generalises $k$-nearest neighbour networks ($k$-NN) to minimum spanning trees. It adds the $k$ shortest edges between components. Since data points start as distinct components, all $k$-NN edges are included in the kMST.

To avoid creating shortcuts in the manifold, a distance threshold $\epsilon$ can be applied. The parameter is specified as a fraction of the shortest edge between components and provides an upper distance limit for the $2$-to-$k$ alternative edges.

import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.k_mst import KMST

X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = KMST(num_neighbors=3, epsilon=2.0).fit(X)

# Drawing the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col

plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
    linewidth=0.2,
    zorder=-1,
    alpha=0.5,
    color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()

k_mst

Approximate $k$-MST

Computing $k$-MSTs using KDTrees can be expensive on some datasets. We provide a version of the algorithm based on Nearest Neighbour Descent for quicker approximations. We combined Boruvka's algorithm with NNDescent to find neighbours that are not already connected in the MST being build.

import matplotlib.pyplot as plt
import matplotlib.collections as mc
from sklearn.datasets import make_swiss_roll
from multi_mst.k_mst_descent import KMSTDescent

X, t = make_swiss_roll(n_samples=2000, noise=0.5, hole=True)
projector = KMSTDescent(num_neighbors=3, epsilon=2.0).fit(X)

# Draw the network
xs = projector.embedding_[:, 0]
ys = projector.embedding_[:, 1]
coo_matrix = projector.graph_.tocoo()
sources = coo_matrix.row
targets = coo_matrix.col

plt.figure(figsize=(4, 3))
plt.scatter(xs, ys, c=t, s=1, edgecolors="none", linewidth=0, cmap="viridis")
lc = mc.LineCollection(
    list(zip(zip(xs[sources], ys[sources]), zip(xs[targets], ys[targets]))),
    linewidth=0.2,
    zorder=-1,
    alpha=0.5,
    color="k",
)
ax = plt.gca()
ax.add_collection(lc)
ax.set_aspect("equal")
plt.subplots_adjust(0, 0, 1, 1)
plt.axis("off")
plt.show()

k_mst

Installation Instructions

The multi_mst package can be installed from pypi:

pip install multi_mst

Acknowledgements

Most code---including the numba KDTree, disjoint set and Boruvka MST construction implementation---is adapted from fast_hdbscan.

License

multi_mst uses the same license as fast_hdbscan: BSD (2-clause). See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

multi_mst-0.1.3.tar.gz (135.9 kB view details)

Uploaded Source

Built Distribution

multi_mst-0.1.3-py3-none-any.whl (43.0 kB view details)

Uploaded Python 3

File details

Details for the file multi_mst-0.1.3.tar.gz.

File metadata

  • Download URL: multi_mst-0.1.3.tar.gz
  • Upload date:
  • Size: 135.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.1 CPython/3.11.10

File hashes

Hashes for multi_mst-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d767827601ef62ae2edfcb731461152dc7b3d8e4abf48661037a1665baf6d78e
MD5 8786f534b97844dba0f556a656886bd2
BLAKE2b-256 3943f042db3784785928f02ad550473103fb0c4daa98a5d2400b8a7e99539a62

See more details on using hashes here.

File details

Details for the file multi_mst-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: multi_mst-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 43.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.1 CPython/3.11.10

File hashes

Hashes for multi_mst-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a93021fe8654e3ad98cf82b05202233391c01947c639e8be8d75f45f4e8e058c
MD5 3f06aa23f11ec1617cd3f641fe3a1f6e
BLAKE2b-256 7e7b9c4213a28e658fda50ac6fde3212a0c840204d3d0bfd7a77a8c9cc7f4570

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page