Skip to main content

A package for dimensionality reduction of probability distributions

Project description

WassersteinTSNE

This package provides the methods described in the Wasserstein t-SNE paper at www.arXiv.org/WassersteinTSNE.

To reproduce the figures in the paper, please also check the repository wassersteinTSNE-paper, which uses a previous version of this package.

Installation

You can install WassersteinTSNE via

pip install WassersteinTSNE

or clone this repository into your working directory.

Basic Usage

You may import the package in either of these ways

import WassersteinTSNE as WT
from WassesteinTSNE import TSNE

Data

The data should be provided in either of two ways:

  1. As a pd.DataFrame where the index indicates which sample belongs to which units
  2. As a np.ndarray where each line corresponds to a sample and a list of unit ids

If you don't have a dataset at hand you can generate a toy dataset by running

dataset, HGM = WT.ToyDataset()

or create a random HGMM

HGM = WT.HierarchicalGaussianMixture(seed=67)
dataset = HGM.generate_data()

By default that creates a HGMM with K=4 classes. This corresponds to a pd.DataFrame with N=100 units and M=30 samples each. If each sample has F=2 (as in this example) features, you can visualize the generated HGMM by

WT.plotMixture(HGM)

A visualization of the two dimensional HGMM

Gaussian Wasserstein t-SNE

The straight forward way to embed your hierarchical dataset is

embedding = WT.TSNE(dataset, seed=67, w=0.5)

or do the procedure step by step with

Gaussians = WT.Dataset2Gaussians(dataset)
GWD       = WT.GaussianWassersteinDistance(Gaussians)
embedding = WT.ComputeTSNE(GWD.matrix(w=0.5), seed=67)

This is built upon openTSNE with the addition, that all embeddings are returned as a pd.DataFrame. These can be visualized with

WT.embedScatter(embedding, title='DemoEmbedding')

If you have defined classes, you can pass a dictionary that maps the unit ids to their class

WT.embedScatter(embedding, labeldict=HGM.labeldict())

to color the units according to their class.

An Gaussian Wasserstein t-SNE embedding of the HGMM

By adjusting the hyperparameter w you can put emphasis on the means or covariance matrices of the units. With

D = GWD.matrix(w=0.7)

you can obtain the distance matrix for any value of w. To visualize a range of matrices you may call

WT.plotMatrices([GWD.matrix(w=w) for w in WT.naming.keys()], WT.naming.values())

Exact Wasserstein Distances

It is possible to compute the exact Wasserstein distances of a dataset as well. Depending on the number of units this can take some time. However, for the dataset in WT.ToyDataset() the computation of the pairwise distance matrix should take less than 8min on a desktop computer by running

D = WT.WassersteinDistanceMatrix(dataset)

This yields the NxN distance matrix as a pd.DataFrame which can then be embedded with

embedding = WT.ComputeTSNE(D)

A shortcut for this procedure is provided with

embedding = WT.TSNE(dataset, method='exact')

Evaluation

We implemented two methods to evaluate the distance matrix of a hierarchical dataset. For both of them it is necessary to have the ground truth available as a dict() or as a list of labels.

labels = HGM.labeldict()

kNN Accuracy

The kNN accuracy computes the kNN graph of the t-SNE embedding and labels each point by the majority vote of its k nearest neighbors. Using the true labels, the accuracy is then computed with

WT.knnAccuracy(embedding, labels)

Leiden clustering

A t-SNE independent method is provided by the Leiden algorithm, that runs directly on the distance matrix.

WT.LeidenClusters(D, labels)

Acknowledgements

...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

WassersteinTSNE-1.0.5.tar.gz (12.8 kB view details)

Uploaded Source

File details

Details for the file WassersteinTSNE-1.0.5.tar.gz.

File metadata

  • Download URL: WassersteinTSNE-1.0.5.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.7

File hashes

Hashes for WassersteinTSNE-1.0.5.tar.gz
Algorithm Hash digest
SHA256 563c856d89049bb988e3b08b3c354b6ac017872abd296e2d6179dc4df745c2eb
MD5 16d330b620b3f65b9ae41932c1ed7a55
BLAKE2b-256 997447eb1ac1d62e3063df02110047b4edf98372d3ccbbbecfc3d74b8bf281bb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page