Skip to main content

Dimensionality reduction through Simplified Topological Abstraction of Data

Reason this release was yanked:

python < 3.9 automatically installs this old version, which does not do what the repo shows

Project description

pySTAD

pipeline status coverage report Binder

This is a python implementation of STAD for the exploration and visualisation of high-dimensional data. This implementation is based on the R version.

Background

STAD is a dimensionality reduction algorithm, that generates an abstract representation of high-dimensional data by giving each data point a location in a graph which preserves the distances in the original high-dimensional space. The STAD graph is built upon the Minimum Spanning Tree (MST) to which new edges are added until the correlation between the graph and the original dataset is maximized. Additionally, STAD supports the inclusion of filter functions to analyse data from new perspectives, emphasizing traits in data which otherwise would remain hidden.

Topological Data analysis

Topological data analysis (TDA) aims to describe the geometric structures present in data. A dataset is interpreted as a point-cloud, where each point is sampled from an underlying geometric object. TDA tries to recover and describe the geometry of that object in terms of features that are invariant "under continuous deformations, such as stretching, twisting, crumpling and bending, but not tearing or gluing". Two geometries that can be deformed into each other without tearing or glueing are homeomorphic (for instance a donut and coffee mug). Typically, TDA describes the holes in a geometry, formalised as Betti numbers.

Like other TDA algorithms, STAD constructs a graph that describes the structure of the data. However, the output of STAD should be interpreted as a data-visualisation result, rather than a topological description of the data's structure. Other TDA algorithms, like mapper, do produce topological results. However, they rely on aggregating the data, whereas STAD encodes the original data points as vertices in a graph.

Dimensionality reduction

Compared to dimensionality reduction algorithms like, t-SNE and UMAP, the STAD produces a more flexible description of the data. A graph can be drawn using different layouts and a user can interact with it. In addition, STAD's projections retain the global structure of the data. In general, the STAD graph tends to underestimate distant data-points in the network structure. On the other hand, t-SNE and UMAP emphasize the relation of data-points with their closest neighbors over that with distant data-points.

from Alcaide & Aerts (2020)

Installation

pySTAD can be installed with:

pip install pystad

Which will install the following dependencies:

  • numpy
  • scipy
  • python-igraph
  • pandas

The example notebooks have additional dependencies:

  • matplotlib
  • networkx
  • scikit-learn
  • jupyterlab
  • ipywidgets

These can be installed with pip or conda. Enabling ipywidgets in jupyter lab takes two more steps:

  • First, install nodejs using conda:
conda install -c conda-forge nodejs
  • Then install the jupyter lab extension:
jupyter labextension install @jupyter-widgets/jupyterlab-manager

Examples

Please see the example notebooks for demonstrations of STAD and interactive exploration dashboards. The code below provides a quick-start:

import stad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse import triu
from sklearn.metrics.pairwise import euclidean_distances

# Circles dataset
data = pd.read_csv('./examples/data/horse.csv', header=0)
data = vertex_data.sample(n=500)
dist = triu(euclidean_distances(data), k = 1)

plt.scatter(data.z, data.y, s=5, c=data.x)
plt.show()

## STAD without lens
network_no_lens, detail = stad.stad(dist)
stad.draw_network_matplotlib(network_no_lens, detail))
plt.show()
stad.draw_correlations_matplotlib(detail)
plt.show()

## STAD with lens
network_lens, detail = stad.stad(dist, lens_values = data['x'], lens_bins = 3)
stad.draw_network_matplotlib(network_lens, detail)
plt.show()
stad.draw_correlations_matplotlib(detail)
plt.show()

Compared to the R-implementation

The R implementation supports 2 dimensional filters (lenses) and uses Simulated Annealing to optimise the output graph. This implementation currently only supports 1D lenses. In addition, aside from simulated annealing, this implementation also supports linear and logistic sweeps.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pystad-0.0.2.tar.gz (14.7 kB view hashes)

Uploaded Source

Built Distribution

pystad-0.0.2-py3-none-any.whl (15.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page