Dimensionality reduction through Simplified Topological Abstraction of Data
Reason this release was yanked:
python < 3.9 automatically installs this old version, which does not do what the repo shows
Project description
pySTAD
This is a python implementation of STAD for the exploration and visualisation of high-dimensional data. This implementation is based on the R version.
Background
STAD is a dimensionality reduction algorithm, that generates an abstract representation of high-dimensional data by giving each data point a location in a graph which preserves the distances in the original high-dimensional space. The STAD graph is built upon the Minimum Spanning Tree (MST) to which new edges are added until the correlation between the graph and the original dataset is maximized. Additionally, STAD supports the inclusion of filter functions to analyse data from new perspectives, emphasizing traits in data which otherwise would remain hidden.
Topological Data analysis
Topological data analysis (TDA) aims to describe the geometric structures present in data. A dataset is interpreted as a point-cloud, where each point is sampled from an underlying geometric object. TDA tries to recover and describe the geometry of that object in terms of features that are invariant "under continuous deformations, such as stretching, twisting, crumpling and bending, but not tearing or gluing". Two geometries that can be deformed into each other without tearing or glueing are homeomorphic (for instance a donut and coffee mug). Typically, TDA describes the holes in a geometry, formalised as Betti numbers.
Like other TDA algorithms, STAD constructs a graph that describes the structure of the data. However, the output of STAD should be interpreted as a data-visualisation result, rather than a topological description of the data's structure. Other TDA algorithms, like mapper, do produce topological results. However, they rely on aggregating the data, whereas STAD encodes the original data points as vertices in a graph.
Dimensionality reduction
Compared to dimensionality reduction algorithms like, t-SNE and UMAP, the STAD produces a more flexible description of the data. A graph can be drawn using different layouts and a user can interact with it. In addition, STAD's projections retain the global structure of the data. In general, the STAD graph tends to underestimate distant data-points in the network structure. On the other hand, t-SNE and UMAP emphasize the relation of data-points with their closest neighbors over that with distant data-points.
Installation
pySTAD can be installed with:
pip install pystad
Which will install the following dependencies:
- numpy
- scipy
- python-igraph
- pandas
The example notebooks have additional dependencies:
- matplotlib
- networkx
- scikit-learn
- jupyterlab
- ipywidgets
These can be installed with pip or conda. Enabling ipywidgets in jupyter lab takes two more steps:
- First, install nodejs using conda:
conda install -c conda-forge nodejs
- Then install the jupyter lab extension:
jupyter labextension install @jupyter-widgets/jupyterlab-manager
Examples
Please see the example notebooks for demonstrations of STAD and interactive exploration dashboards. The code below provides a quick-start:
import stad
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.sparse import triu
from sklearn.metrics.pairwise import euclidean_distances
# Circles dataset
data = pd.read_csv('./examples/data/horse.csv', header=0)
data = vertex_data.sample(n=500)
dist = triu(euclidean_distances(data), k = 1)
plt.scatter(data.z, data.y, s=5, c=data.x)
plt.show()
## STAD without lens
network_no_lens, detail = stad.stad(dist)
stad.draw_network_matplotlib(network_no_lens, detail))
plt.show()
stad.draw_correlations_matplotlib(detail)
plt.show()
## STAD with lens
network_lens, detail = stad.stad(dist, lens_values = data['x'], lens_bins = 3)
stad.draw_network_matplotlib(network_lens, detail)
plt.show()
stad.draw_correlations_matplotlib(detail)
plt.show()
Compared to the R-implementation
The R implementation supports 2 dimensional filters (lenses) and uses Simulated Annealing to optimise the output graph. This implementation currently only supports 1D lenses. In addition, aside from simulated annealing, this implementation also supports linear and logistic sweeps.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.