Skip to main content

UMAP with GPUs

Project description

GPU Parallelized Uniform Manifold Approximation and Projection (GPUMAP) is the GPU-ported version of the UMAP dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.

At the moment only CUDA capable GPUs are supported. Due to a dependency on FAISS, only Linux (and potentially MacOS) platforms are supported at the moment.

For further information on UMAP see the the original implementation https://github.com/lmcinnes/umap/.

How to use GPUMAP

The gpumap package inherits from sklearn classes, and thus drops in neatly next to other sklearn transformers with an identical calling API.

import gpumap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = gpumap.GPUMAP().fit_transform(digits.data)

There are a number of parameters that can be set for the GPUMAP class; the major ones are as follows:

  • n_neighbors: This determines the number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default.

  • min_dist: This controls how tightly the embedding is allowed compress points together. Larger values ensure embedded points are more evenly distributed, while smaller values allow the algorithm to optimise more accurately with regard to local structure. Sensible values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.

The metric parameter is supported to keep the interface aligned with UMAP, however, setting it to anything but ‘euclidean’ will fall back to the sequential version. Processing sparse matrices is not supported either, and will similarly cause a fallback to the sequential version for parts of the algorithm.

Performance and Examples

GPUMAP, like UMAP, is very efficient at embedding large high dimensional datasets. In particular it scales well with both input dimension and embedding dimension. Performance depends strongly depends on the used GPU. For a problem such as the 784-dimensional MNIST digits dataset with 70000 data samples, GPUMAP can complete the embedding in around 30 seconds with an (outdated) NVIDIA GTX 745 graphics card. More recent hardware will scale accordingly. Despite this runtime efficiency UMAP still produces high quality embeddings.

The obligatory MNIST digits dataset, embedded in 29 seconds using a 3.6 GHz Intel Core i7 processor and an NVIDIA GTX 745 GPU (n_neighbors=10, min_dist=0.001):

GPUMAP embedding of MNIST digits

The MNIST digits dataset is fairly straightforward however. A better test is the more recent “Fashion MNIST” dataset of images of fashion items (again 70000 data sample in 784 dimensions). GPUMAP produced this embedding in 2 minutes exactly (n_neighbors=5, min_dist=0.1):

GPUMAP embedding of "Fashion MNIST"

Installing

GPUMAP has the same dependecies of UMAP, namely scikit-learn, numpy, scipy and numba. GPUMAP adds a requirement for faiss to perform nearest-neighbor search on GPUs.

Requirements:

  • scikit-learn

  • (numpy)

  • (scipy)

  • numba

  • faiss

Install Options

GPUMAP can be installed via Conda, PyPi or from source:

Option 1: Conda

Set up a new conda environment, if needed.

conda create -n env

conda activate env

conda install python

Install dependecies: Numba and FAISS

conda install numba
conda install scikit-learn

conda install faiss-gpu cudatoolkit=10.0 -c pytorch # For CUDA10
# For older CUDA versions:
# conda install faiss-gpu cudatoolkit=8.0 -c pytorch # For CUDA8
# conda install faiss-gpu cudatoolkit=9.0 -c pytorch # For CUDA9

conda install -c conda-forge gpumap

Option 2: PyPi

GPUMAP is also available as a PyPi package.

pip install scikit-learn numba faiss gpumap

Note that the prebuilt FAISS library is not officially supported by upstream.

Option 3: Build

Building from source is easy, clone the repository or get the code onto your computer by other means and run the installer with:

python setup.py install

Note that the dependecies need to be installed beforehand. These are the FAISS https://github.com/facebookresearch/faiss/blob/master/INSTALL.md library and Numba http://numba.pydata.org/numba-doc/latest/user/installing.html.

License

The gpumap package is based on the umap package and thus is also 3-clause BSD licensed.

Contributing

Contributions are always welcome! Fork away!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gpumap-0.1.1.tar.gz (43.3 kB view details)

Uploaded Source

Built Distribution

gpumap-0.1.1-py3-none-any.whl (52.7 kB view details)

Uploaded Python 3

File details

Details for the file gpumap-0.1.1.tar.gz.

File metadata

  • Download URL: gpumap-0.1.1.tar.gz
  • Upload date:
  • Size: 43.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for gpumap-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f8f6a72081f957ff5642c72f8c9c87bf419bd991bf7e85ed1ee1d387f48ecc82
MD5 b17aa913694bab94db20f58e65d472aa
BLAKE2b-256 e72d7264c32ceb1fb2135b715775c381adf83611a958d9a9fa0e8dbad3e5b7cd

See more details on using hashes here.

File details

Details for the file gpumap-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: gpumap-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 52.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.28.1 CPython/3.7.3

File hashes

Hashes for gpumap-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dd1a04b9b99b379248ab179c6555d30416424958f2e6e7edefe417e73184d6f7
MD5 0c21d5c9945d9c3425c5c98a419a3869
BLAKE2b-256 9168cb9459a558d826086e1c668b85b32a9f04cc7eeb0758f6dba571989f097e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page