Skip to main content

NeuralEE: a GPU-accelerated elastic embedding dimensionality reduction method for visualization of large-scale scRNA-seq data

Project description

Build Status Documentation Status Flowchart of NeuralEE

This is an applicable version for NeuralEE.

  1. The datasets loading and preprocessing module is modified from scVI v0.3.0.

  2. Define NeuralEE class and some auxiliary function, mainly for cuda computation, except like entropic affinity calculation which is quite faster computed on cpu.

  3. General elastic embedding algorithm on cuda is given based on matlab code from Max Vladymyrov.

  4. Add some demos of notebook helping to reproduce.

Installation

  1. Install Python 3.7.

  2. Install PyTorch. If you have an NVIDIA GPU, be sure to install a version of PyTorch that supports it. NeuralEE runs much faster with a discrete GPU.

  3. Install NeuralEE through pip or from GitHub:

pip install neuralee
git clone git://github.com/HiBearME/NeuralEE.git
cd NeuralEE
python setup.py install --user

How to use NeuralEE

  1. Data Loading

Our datasets loading and preprocessing module is based on scVI v0.3.0. How to download online datasets or load generic datasets is the same as scVI v0.3.0.

For example, load the online dataset CORTEX Dataset, which consists of 3,005 mouse cortex cells profiled with the Smart-seq2 protocol. To facilitate comparison with other methods, we use a filtered set of 558 highly variable genes as the original paper.

from neuralee.dataset import CortexDataset
dataset = CortexDataset(save_path='../data/')
dataset.log_shift()
dataset.subsample_genes(558)
dataset.standardscale()

Load the h5ad file for BRAIN-LARGE Dataset, which consists of 1.3 million mouse brain cells and has been already preprocessed and remained by 50 principal components.

from neuralee.dataset import GeneExpressionDataset
import anndata
adata = anndata.read_h5ad('../genomics_zheng17_50pcs.h5ad') # Your own local dataset is needed.
dataset = GeneExpressionDataset(adata.X)

For other generic datasets, it’s also convenient to use GeneExpressionDataset to load them.

  1. Embedding

There are a number of parameters that can be set for the UMAP class; the major ones are as follows:

  • d: This determines the dimension of embedding space, with 2 being default.

  • lam: This determines the trade-off parameter of EE objective function. Larger values make embedded points more distributed. In general this parameter should be non-negative, with 1.0 being default.

  • perplexity: This determines the perplexity parameter for calculation of the attractive matrix. This parameter plays the same role as t-SNE, with 30.0 being default.

  • N_small: This determines the batch-size for the stochastic optimization. Smaller value makes more accurate approximation to the original EE objective function, but needs larger computer memory to save the attractive and repulsive matrices and longer time for optimization. It could be inputted as integer or percentage, with 1.0 being default, which means not applied with stochastic optimization. we recommend to use stochastic optimization when only necessary, such as on BRAIN-LARGE Dataset, which is hard to save the original attractive and repulsive matrices for a normal computer, and if not with stochastic optimization, it could run out of memory.

  • maxit: This determines the maximum iteration of optimization. Larger values makes embedded points stabler and more convergent, but consumes longer time, with 500 being default.

  • pin_memory: This determines whether to transfer all the matrix to the GPU at once if a GPU is available, with True being default. If it’s True, it could save lots of time of transferring data from computer memory to GPU memory in each iteration, but if your GPU memory is limited, it must be set as False, for each iteration, the matrices of the current iteration are re-transferred to the GPU at the beginning and freed at the end.

The embedding steps are as follows:

a). Calculate attractive and repulsive matrices for the dataset.

If EE and NeuralEE without stochastic optimization will be used, it could be calculated as:

dataset.affinity(perplexity=30.0)

Or NeuralEE with stochastic optimization will be used, for example, 10% samples for each batch, it could be calculated as:

dataset.affinity_split(perplexity=30.0, N_small=0.1, verbose=True)
# verbose=True determines whether to show the progress of calculation.

b). Initialize NeuralEE class.

import torch
# detect whether to use GPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
NEE = NeuralEE(dataset, d=2, lam=1, device=device)

c). Embedding.

If EE will be used, it could be calculated as:

results_EE = NEE.EE(maxit=50)

If NueralEE will be used, it could be calculated as:

results_NeuralEE = NEE.fine_tune(maxit=50, verbose=True, pin_memory=False)

For reproduction of original paper’s results, check at Jupyter notebooks files.

Examples

  1. HEMATO

HEMATO Dataset includes 4,016 cells, and provides a snapshot of hematopoietic progenitor cells differentiating into various lineages.

This dataset is quite small, so we directly apply NeuralEE.EE and with (lam =10, perplexity =30). And it could finish in several minutes on CPU, and in several seconds on GPU.

EE of HEMATO
  1. RETINA

RETINA Dataset includes 27,499 mouse retinal bipolar neurons. Cluster annotation is from 15 cell-types from the original paper.

Size of this dataset is moderate, and EE on CPU could finish in several hours. However, NeuralEE on a normal GPU, equipped with 11G memory, without stochastic optimization could finish in almost 3 minutes. And on a GPU of limited memory, NeuralEE with (N_small =0.5, pin_memory = True) could finish in almost 2 minutes. The follow embedding shows the result of NeuralEE with (lam =10, perplexity =30, N_small =0.5, pin_memory = True).

NeuralEE of HEMATO

To reproduce this, check at Jupyter notebook for RETINA dataset.

  1. BRAIN-LARGE

BRAIN-LARGE Dataset consists of 1.3 million mouse brain cells, and it’s clustered by Louvain algorithm.

This dataset is quite large, so it’s very difficult to apply EE. Instead, we apply NeuralEE with (lam =10, perplexity =30, N_small =0.5, maxit =50, pin_memory = False) on a normal GPU, equipped with 11G memory (when set pin_memory as False, It also works on a GPU of limimted memory and only uses less than 1G memory). It needs at least 64G computer memory to save data, and it could finish less than half hour.

NeuralEE of BRAIN LARGE

To reproduce this, check at Jupyter notebook for BRAIN-LARGE dataset.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

neuralee-0.1.5.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

neuralee-0.1.5-py2.py3-none-any.whl (38.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file neuralee-0.1.5.tar.gz.

File metadata

  • Download URL: neuralee-0.1.5.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1.post20200529 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.6

File hashes

Hashes for neuralee-0.1.5.tar.gz
Algorithm Hash digest
SHA256 dc38fe890a0de12bcff5f820e2d51698ac67befe4e50f3e4e5718ed26f06085b
MD5 b8642fa271e4fadd90b46f69fd3d1a72
BLAKE2b-256 5670d5f485c0b79b78f90d7f7fbca2520fb8c5043038e065ecc1f1c4f8b00fdf

See more details on using hashes here.

File details

Details for the file neuralee-0.1.5-py2.py3-none-any.whl.

File metadata

  • Download URL: neuralee-0.1.5-py2.py3-none-any.whl
  • Upload date:
  • Size: 38.7 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1.post20200529 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.7.6

File hashes

Hashes for neuralee-0.1.5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 2cf0e02af945df0612089154c8d851ed1890d93538bafc8a21aef28a66db8b25
MD5 33cf6bb9c89111d90b52ca367d2c2de8
BLAKE2b-256 82919fa55dcc58efc9b5ae54404e2452664bbd67bb0dfb79f717e5e115c63145

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page