Skip to main content

Marker selection library for single-cell RNA-seq data

Project description

MarkerMap

MarkerMap is a generative model for selecting the most informative gene markers by projecting cells into a shared, interpretable embedding without sacrificing accuracy.

Table of Contents

  1. Installation
    1. MacOs
    2. Windows
  2. Quick Start
    1. Simple Example
    2. Benchmark Example
  3. Features
  4. For Developers
  5. License

Installation

MacOS

  • Clone the repository git clone https://github.com/Computational-Morphogenomics-Group/MarkerMap.git
  • Navigate to the MarkerMap directory cd MarkerMap
  • Locally install the package pip install -e . (may have to use pip3 if your system has both python2 and python3 installed)
  • You might have to install libomp with homebrew, brew install libomp

Windows

  • Coming soon!

Quick Start

Simple Example

Copy the code from here, or take a look at scripts/quick_start.py for a python script or notebooks/quick_start.ipynb for a Jupyter Notebook.

Imports

Data is handled with numpy and with scanpy which is used for many computational biology datasets.

import numpy as np
import scanpy as sc

from markermap.vae_models import MarkerMap, train_model
from markermap.utils import (
    new_model_metrics,
    parse_adata,
    plot_confusion_matrix,
    split_data_into_dataloaders,
)

Set Parameters

Define some parameters that we will use when creating the MarkerMap.

  • z_size is the dimension of the latent space in the variational auto-encoder. We always use 16
  • hidden_layer_size is the dimension of the hidden layers in the auto-encoder that come before and after the latent space layer. This is dependent on the data, a good rule of thumb is ~10% of the dimension of the input data. For the CITE-seq data which has 500 columns, we will use 64
  • k is the number of markers to extract
  • Set the file_path to wherever your data is
z_size = 16
hidden_layer_size = 64
k=50

file_path = 'data/cite_seq/CITEseq.h5ad'

Data

Set file_path to wherever your data is located. We then read in the data using scanpy and break it into X and y using the parse_data function. The text labels in adata.obs['annotation'] will be converted to number labels so that MarkerMap can use them properly.

We then split the data into training, validation, and test sets with a 70%, 10%, 20% split. MarkerMap uses a validation set during the training process.

file_path = '../data/cite_seq/CITEseq.h5ad'

adata = sc.read_h5ad(file_path)
adata.obs['annotation'] = adata.obs['names']
X, y, encoder = parse_adata(adata)

train_dataloader, val_dataloader, _, train_indices, val_indices, test_indices = split_data_into_dataloaders(
    X,
    y,
    train_size=0.7,
    val_size=0.1,
)
X.shape

Define and Train the Model

Now it is time to define the MarkerMap. There are many hyperparameters than can be tuned here, but the most important are k and the loss_tradeoff. The k parameter may require some domain knowledge, but it is fairly easy to benchmark for different levels of k, as we will see in the later examples. Loss_tradeoff is also important, see the paper for a further discussion. In general, we have 3 levels, 0 (supervised only), 0.5 (mixed supervised-unsupervised) and 1 (unsupervised only). This step may take a couple of minutes.

supervised_marker_map = MarkerMap(X.shape[1], hidden_layer_size, z_size, len(np.unique(y)), k, loss_tradeoff=0)
train_model(supervised_marker_map, train_dataloader, val_dataloader)

Evaluate the model

Finally, we test the model. The new_model_metrics function trains a simple model such as a RandomForestClassifer on the training data restricted to the k markers, and then evaluates it on the testing data. We then print the misclassification rate, the f1-score, and plot a confusion matrix.

misclass_rate, test_rep, cm = new_model_metrics(
    X[np.concatenate([train_indices, val_indices]), :],
    y[np.concatenate([train_indices, val_indices])],
    X[test_indices, :],
    y[test_indices],
    markers = supervised_marker_map.markers().clone().cpu().detach().numpy(),
)

print(misclass_rate)
print(test_rep['weighted avg']['f1-score'])
plot_confusion_matrix(cm, encoder.classes_)

Benchmark Example

Now we will do an example where we use the benchmarking tools of the package. Follows the steps from the Simple Example through the data section, then pick up here. Alternatively, checkout out scripts/quick_start_benchmark.py for a python script or notebooks/quick_start_benchmark.ipynb for a Jupyter Notebook.

Define the Models

Now it is time to define all the models that we are benchmarking. For this tutorial, we will benchmark the three versions of MarkerMap: Supervised, Mixed, and Unsupervised. Each model in this repository comes with a function getBenchmarker where we specify all the parameters used for constructing the model and all the parameters used for training the model. The benchmark function will then run and evaluate them all. For this tutorial we will also specify a train argument, max_epochs which limits the number of epochs during training.

supervised_marker_map = MarkerMap.getBenchmarker(
  create_kwargs = {
    'input_size': X.shape[1],
    'hidden_layer_size': hidden_layer_size,
    'z_size': z_size,
    'num_classes': len(np.unique(y)),
    'k': k_range[0], # because we are benchmarking over k, this will get replaced by the benchmark function
    'loss_tradeoff': 0,
  },
  train_kwargs = {
    'max_epochs': max_epochs,
  }
)

mixed_marker_map = MarkerMap.getBenchmarker(
  create_kwargs = {
    'input_size': X.shape[1],
    'hidden_layer_size': hidden_layer_size,
    'z_size': z_size,
    'num_classes': len(np.unique(y)),
    'k': k_range[0],
    'loss_tradeoff': 0.5,
  },
  train_kwargs = {
    'max_epochs': max_epochs,
  }
)

unsupervised_marker_map = MarkerMap.getBenchmarker(
  create_kwargs = {
    'input_size': X.shape[1],
    'hidden_layer_size': hidden_layer_size,
    'z_size': z_size,
    'num_classes': None, #since it is unsupervised, we can just say that the number of classes is None
    'k': k_range[0],
    'loss_tradeoff': 1.0,
  },
  train_kwargs = {
    'max_epochs': max_epochs,
  },
)

Run the Benchmark

Finally, we run the benchmark by passing in all the models as a dictionary, the number of times to run the model, the data and labels, the type of benchmark, and the range of values we are benchmarking over. We will set the range of k values as [10, 25, 50], but you may want to go higher in practice. Note that we also add the RandomBaseline model. This model selects k markers at random, and it is always a good idea to include this one because it performs better than might be expected. It is also very fast, so there is little downside.

The benchmark function splits the data, runs each model with the specified k, then trains a simple model on just the k markers and evaluates how they perform on a test set that was not used to train the marker selection model or the simple evaluation model. The results reported have accuracy and f1 score, and we can visualize them in a plot with the plot_benchmarks function.

This function will train each MarkerMap 3 times for a total of 9 runs, so it may take over 10 minutes depending on your hardware. Feel free to comment out some of the models.

k_range = [10, 25, 50]

results, benchmark_label, benchmark_range = benchmark(
  {
    'Unsupervised Marker Map': unsupervised_marker_map,
    'Supervised Marker Map': supervised_marker_map,
    'Mixed Marker Map': mixed_marker_map,
    'Baseline': RandomBaseline.getBenchmarker(train_kwargs = { 'k': k_range[0] }),
  },
  1, # num_times, how many different random train/test splits to run the models on
  X,
  y,
  benchmark='k',
  benchmark_range=k_range,
)

plot_benchmarks(results, benchmark_label, benchmark_range, mode='accuracy')

Features

  • The MarkerMap package provides functionality to easily benchmark different marker selection methods to evaluate performance under a number of metrics. Each model has a getBenchmarker function which takes model constructor parameters and trainer parameters and returns a model function. The benchmark function then takes all these model functions, a dataset, and the desired type of benchmarking and runs all the models to easily compare performance. See scripts/benchmark_k for examples.
  • Types of benchmarking:
    • k: The number of markers to select from
    • label_error: Given a range of percentages, pick that percent of points in the training + validation set and set their label to a random label form among the existing labels.
  • To load the data, you can make use of the following functions: get_citeseq, get_mouse_brain, get_paul, and get_zeisel. Note that both get_mouse_brain and get_paul do some pre-processing, including removing outliers and normalizing the data in the case of Mouse Brain.

For Developers

  • If you are going to be developing this package, also install the following: pip install pre-commit pytest
  • In the root directory, run pre-commit install. You should see a line like pre-commit installed at .git/hooks/pre-commit. Now when you commit to your local branch, it will run jupyter nbconvert --clean-output on all the local jupyter notebooks on that branch. This ensures that only clean notebooks are uploaded to the github.
  • To run tests, simply run pytest: pytest.

License

  • This project is licensed under the terms of the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markermap-1.0.1.tar.gz (42.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

markermap-1.0.1-py3-none-any.whl (41.3 kB view details)

Uploaded Python 3

File details

Details for the file markermap-1.0.1.tar.gz.

File metadata

  • Download URL: markermap-1.0.1.tar.gz
  • Upload date:
  • Size: 42.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.10

File hashes

Hashes for markermap-1.0.1.tar.gz
Algorithm Hash digest
SHA256 ae4b954b7c7cb81f0f399004cbbecfe936fe024abd96e8977d6c60f6444acf1f
MD5 70f979960287adf977b13c6f64497b6e
BLAKE2b-256 94d84942536cbf7e16cfecae775682a875909c590ba63c853c47ae3b05d30658

See more details on using hashes here.

File details

Details for the file markermap-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: markermap-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 41.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.10

File hashes

Hashes for markermap-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2396eb5d05e536c7c8e7e0ee4d910751861b5a952078487ab12b035409f55216
MD5 3900120b1e70fb5c0607f6cdecb1b69b
BLAKE2b-256 7db29c50e8030b9d61551e3b6265a5c2a305e2fcbf55f768dc630355e43a5df3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page