Dimensionality reduction through Simplified Topological Abstraction of Data
Project description
pySTAD
This is a python implementation of STAD for exploration and visualisation of highdimensional data. This implementation is based on the R version.
Background
STAD is a dimensionality reduction algorithm, that generates an abstract representation of highdimensional data by giving each data point a location in a graph which preserves the distances in the original highdimensional space. The STAD graph is built upon the Minimum Spanning Tree (MST) to which new edges are added until the correlation between the graph and the original dataset is maximized. Additionally, STAD supports the inclusion of filter functions to analyse data from new perspectives, emphasizing traits in data which otherwise would remain hidden.
Topological Data analysis
Topological data analysis (TDA) aims to describe the geometric structures present in data. A dataset is interpreted as a pointcloud, where each point is sampled from an underlying geometric object. TDA tries to recover and describe the geometry of that object in terms of features that are invariant "under continuous deformations, such as stretching, twisting, crumpling and bending, but not tearing or gluing". Two geometries that can be deformed into each other without tearing or glueing are homeomorphic (for instance a donut and coffee mug). Typically, TDA describes the holes in a geometry, formalised as Betti numbers.
Like other TDA algorithms, STAD constructs a graph that describes the structure of the data. However, the output of STAD should be interpreted as a datavisualisation result, rather than a topological description of the data's structure. Other TDA algorithms, like mapper, do produce topological results. However, they rely on aggregating the data, whereas STAD encodes the original data points as vertices in a graph.
Dimensionality reduction
Compared to dimensionality reduction algorithms like, tSNE and UMAP, the STAD produces a more flexible description of the data. A graph can be drawn using different layouts and a user can interact with it. In addition, STAD's projections retain the global structure of the data. In general, the STAD graph tends to underestimate distant datapoints in the network structure. On the other hand, tSNE and UMAP emphasize the relation of datapoints with their closest neighbors over that with distant datapoints.
Installation
pySTAD can be installed with pip:
pip install pystad
or compiled from source (see development/Development.md
for instructions)
How to use pySTAD
From the commandline
pySTAD has a __main__
entrypoint which can be called using:
python m stad help
or stad help
from the commandline. These
entrypoints take a distance matrix in the form of a .csv
file and print the
resulting network as a JSON string to stdout. Some information of the network is
logged to stderr, including the number of added edges and the correlation of the
networkdistances with the original distances.
From within python
pySTAD is the most versatile when used within python. Three basic examples are shown below and the example jupyterlab notebooks can be explored on binder without installing pySTAD on your machine.
Example 1
Most basic use of pySTAD using the default options.
import stad as sd import numpy as np import networkx as nx import pandas as pd import matplotlib.pyplot as plt from scipy.spatial.distance import pdist # Load a dataset data = pd.read_csv('./data/five_circles.csv', header=0) condensed_distances = pdist(data[['x', 'y']], 'euclidean') # Show the data in 2D plt.scatter(data.x, data.y, s=5, c=data.x) plt.show() ## Compute stad network, sweep = sd.stad(condensed_distances) sd.plot.network(network, layout='kk', node_color=data['x']) plt.show() # Show the correlation trace sd.plot.sweep(condensed_distances, sweep) plt.show()
Example 2
Use a lens / filter to highlight some property of the data.
import stad as sd import numpy as np import networkx as nx import pandas as pd import matplotlib.pyplot as plt from scipy.spatial.distance import pdist # Load a dataset data = pd.read_csv('./data/five_circles.csv', header=0) condensed_distances = pdist(data[['x', 'y']], 'euclidean') # Show the dataset in 2D plt.scatter(data.x, data.y, s=5, c=data.x) plt.show() # Run stad with a lens lens = sd.Lens(data['x'].to_numpy(), n_bins=3) network, sweep = sd.stad(condensed_distances, lens=lens) # Show which edges cross filtersegment boundaries edge_color = np.where(lens.adjacent_edges[sweep.network_mask], '#f33', '#ddd') sd.plot.network(network, layout='kk', edge_color=edge_color, node_color=data['x']) plt.show() # Show the correlation trace sd.plot.sweep(condensed_distances, sweep) plt.show()
Example 3
Explore the resulting network interactively in jupyterlab.
import stad as sd import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl import ipywidgets as widgets from scipy.spatial.distance import pdist # Load data, compute distances, show 2d projection data = pd.read_csv('./data/horse.csv') idx = np.random.choice(data.shape[0], 500, replace=False) data = data.iloc[idx, :] dist = pdist(data, 'euclidean') plt.scatter(data.z, data.y, s=5, c=data.z) plt.show() ## Compute stad without lens network, sweep = sd.stad(dist, sweep=sd.ThresholdDistance(0.11)) w = sd.Widget() w
# show() calls only work after the frontend of the widget is instantiated. # so they have to be in a cell below the cell that outputs the widget. w.show(network, node_color=data['z'])
Compared to the Rimplementation
The R implementation supports 2dimensional filters (lenses) and uses Simulated Annealing to optimise the output graph. This implementation currently only supports 1D lenses. In addition, this implementation uses a logistic sweep on the number of edges in the network by default, but still supports optimization functions such as simulated annealing.
This implementation is optimised using Cython and OpenMP, resulting shorter computation times compared to the R implementation.
The R implementation uses a MST refinement procedure when using a lens / filter, as described in the paper. This implementation just uses the MST. The refinement procedure depends on community detection to remove edges between different groups of datapoints within the same filter segment, which is a process that requires finetuning per dataset. When communities are not detected correctly, edges between distinct groups of datapoints within a filter segment remain in the network, obscuring the patterns the filter should expose.
How to cite
@TODO create DOI for software releases
Please cite our papers when using this software:
APA:
Alcaide, D., & Aerts, J. (2020). Spanning Trees as Approximation of Data
Structures. IEEE Transactions on Visualization and Computer Graphics.
https://doi.org/10.1109/TVCG.2020.2995465
Bibtex:
@article{alcaide2020spanning,
title={Spanning Trees as Approximation of Data Structures},
author={Alcaide, Daniel and Aerts, Jan},
journal={IEEE Transactions on Visualization and Computer Graphics},
year={2020},
publisher={IEEE},
doi = {10.1109/TVCG.2020.2995465},
}
and for the STADR variant:
APA:
Alcaide, D., & Aerts, J. (2021). A visual analytic approach for the
identification of ICU patient subpopulations using ICD diagnostic codes.
PeerJ Computer Science, 7, e430.
https://doi.org/10.7717/peerjcs.430
Bibtex:
@article{alcaide2021visual,
title={A visual analytic approach for the identification of ICU patient subpopulations using ICD diagnostic codes},
author={Alcaide, Daniel and Aerts, Jan},
journal={PeerJ Computer Science},
volume={7},
pages={e430},
year={2021},
publisher={PeerJ Inc.}
doi = {10.7717/peerjcs.430}
}
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size pySTAD0.2.10.tar.gz (6.0 MB)  File type Source  Python version None  Upload date  Hashes View 
Filename, size pySTAD0.2.10cp39cp39win_amd64.whl (4.1 MB)  File type Wheel  Python version cp39  Upload date  Hashes View 
Hashes for pySTAD0.2.10cp39cp39win_amd64.whl
Algorithm  Hash digest  

SHA256  013f1cb2c6bbecf8222b868edcf960427eb2998370a0f4775366a50767ae115c 

MD5  efca945409b733a0012ce076ed6c9dd6 

BLAKE2256  8513f15ce7e44179a18b8bfb1f884f3f1d811ed2596f649f91c834b4edde1e1a 