Marker selection library for single-cell RNA-seq data
Project description
MarkerMap
MarkerMap is a generative model for selecting the most informative gene markers by projecting cells into a shared, interpretable embedding without sacrificing accuracy.
Table of Contents
Installation
MacOS
- The easiest way to install is with pip from https://pypi.org/project/markermap/
- Simply do
pip install markermap
- To use the persist package, you will need to install it following the instructions: https://github.com/iancovert/persist/
- Note: If using on Google Colab, you may have to restart the runtime after installing because
scanpy
installs a newer version of matplotlib than the default one. - Note: Smashpy specifies an older version of tensorflow (==2.5.0), so this creates installation problems. Thus we do not require installing Smashpy to use the package, but it will need to be installed if you want to compare that method.
Windows
- Coming soon!
Quick Start
Simple Example
Copy the code from here, or take a look at scripts/quick_start.py
for a python script or notebooks/quick_start.ipynb
for a Jupyter Notebook.
Imports
Data is handled with numpy and with scanpy which is used for many computational biology datasets.
import numpy as np
import scanpy as sc
from markermap.vae_models import MarkerMap, train_model
from markermap.utils import (
new_model_metrics,
parse_adata,
plot_confusion_matrix,
split_data_into_dataloaders,
)
Set Parameters
Define some parameters that we will use when creating the MarkerMap.
- z_size is the dimension of the latent space in the variational auto-encoder. We always use 16
- hidden_layer_size is the dimension of the hidden layers in the auto-encoder that come before and after the latent space layer. This is dependent on the data, a good rule of thumb is ~10% of the dimension of the input data. For the CITE-seq data which has 500 columns, we will use 64
- k is the number of markers to extract
- Set the file_path to wherever your data is
z_size = 16
hidden_layer_size = 64
k=50
file_path = 'data/cite_seq/CITEseq.h5ad'
Data
Set file_path to wherever your data is located. We then read in the data using scanpy and break it into X and y using the parse_data function. The text labels in adata.obs['annotation'] will be converted to number labels so that MarkerMap can use them properly.
We then split the data into training, validation, and test sets with a 70%, 10%, 20% split. MarkerMap uses a validation set during the training process.
file_path = '../data/cite_seq/CITEseq.h5ad'
adata = sc.read_h5ad(file_path)
adata.obs['annotation'] = adata.obs['names']
X, y, encoder = parse_adata(adata)
train_dataloader, val_dataloader, _, train_indices, val_indices, test_indices = split_data_into_dataloaders(
X,
y,
train_size=0.7,
val_size=0.1,
)
X.shape
Define and Train the Model
Now it is time to define the MarkerMap. There are many hyperparameters than can be tuned here, but the most important are k and the loss_tradeoff. The k parameter may require some domain knowledge, but it is fairly easy to benchmark for different levels of k, as we will see in the later examples. Loss_tradeoff is also important, see the paper for a further discussion. In general, we have 3 levels, 0 (supervised only), 0.5 (mixed supervised-unsupervised) and 1 (unsupervised only). This step may take a couple of minutes.
supervised_marker_map = MarkerMap(X.shape[1], hidden_layer_size, z_size, len(np.unique(y)), k, loss_tradeoff=0)
train_model(supervised_marker_map, train_dataloader, val_dataloader)
Evaluate the model
Finally, we test the model. The new_model_metrics function trains a simple model such as a RandomForestClassifer on the training data restricted to the k markers, and then evaluates it on the testing data. We then print the misclassification rate, the f1-score, and plot a confusion matrix.
misclass_rate, test_rep, cm = new_model_metrics(
X[np.concatenate([train_indices, val_indices]), :],
y[np.concatenate([train_indices, val_indices])],
X[test_indices, :],
y[test_indices],
markers = supervised_marker_map.markers().clone().cpu().detach().numpy(),
)
print(misclass_rate)
print(test_rep['weighted avg']['f1-score'])
plot_confusion_matrix(cm, encoder.classes_)
Benchmark Example
Now we will do an example where we use the benchmarking tools of the package. Follows the steps from the Simple Example through the data section, then pick up here. Alternatively, checkout out scripts/quick_start_benchmark.py
for a python script or notebooks/quick_start_benchmark.ipynb
for a Jupyter Notebook.
Define the Models
Now it is time to define all the models that we are benchmarking. For this tutorial, we will benchmark the three versions of MarkerMap: Supervised, Mixed, and Unsupervised. Each model in this repository comes with a function getBenchmarker
where we specify all the parameters used for constructing the model and all the parameters used for training the model. The benchmark function will then run and evaluate them all. For this tutorial we will also specify a train argument, max_epochs
which limits the number of epochs during training.
supervised_marker_map = MarkerMap.getBenchmarker(
create_kwargs = {
'input_size': X.shape[1],
'hidden_layer_size': hidden_layer_size,
'z_size': z_size,
'num_classes': len(np.unique(y)),
'k': k_range[0], # because we are benchmarking over k, this will get replaced by the benchmark function
'loss_tradeoff': 0,
},
train_kwargs = {
'max_epochs': max_epochs,
}
)
mixed_marker_map = MarkerMap.getBenchmarker(
create_kwargs = {
'input_size': X.shape[1],
'hidden_layer_size': hidden_layer_size,
'z_size': z_size,
'num_classes': len(np.unique(y)),
'k': k_range[0],
'loss_tradeoff': 0.5,
},
train_kwargs = {
'max_epochs': max_epochs,
}
)
unsupervised_marker_map = MarkerMap.getBenchmarker(
create_kwargs = {
'input_size': X.shape[1],
'hidden_layer_size': hidden_layer_size,
'z_size': z_size,
'num_classes': None, #since it is unsupervised, we can just say that the number of classes is None
'k': k_range[0],
'loss_tradeoff': 1.0,
},
train_kwargs = {
'max_epochs': max_epochs,
},
)
Run the Benchmark
Finally, we run the benchmark by passing in all the models as a dictionary, the number of times to run the model, the data and labels, the type of benchmark, and the range of values we are benchmarking over. We will set the range of k values as [10, 25, 50]
, but you may want to go higher in practice. Note that we also add the RandomBaseline model. This model selects k markers at random, and it is always a good idea to include this one because it performs better than might be expected. It is also very fast, so there is little downside.
The benchmark function splits the data, runs each model with the specified k, then trains a simple model on just the k markers and evaluates how they perform on a test set that was not used to train the marker selection model or the simple evaluation model. The results reported have accuracy and f1 score, and we can visualize them in a plot with the plot_benchmarks function.
This function will train each MarkerMap 3 times for a total of 9 runs, so it may take over 10 minutes depending on your hardware. Feel free to comment out some of the models.
k_range = [10, 25, 50]
results, benchmark_label, benchmark_range = benchmark(
{
'Unsupervised Marker Map': unsupervised_marker_map,
'Supervised Marker Map': supervised_marker_map,
'Mixed Marker Map': mixed_marker_map,
'Baseline': RandomBaseline.getBenchmarker(train_kwargs = { 'k': k_range[0] }),
},
1, # num_times, how many different random train/test splits to run the models on
X,
y,
benchmark='k',
benchmark_range=k_range,
)
plot_benchmarks(results, benchmark_label, benchmark_range, mode='accuracy')
Features
- The MarkerMap package provides functionality to easily benchmark different marker selection methods to evaluate performance under a number of metrics. Each model has a
getBenchmarker
function which takes model constructor parameters and trainer parameters and returns a model function. Thebenchmark
function then takes all these model functions, a dataset, and the desired type of benchmarking and runs all the models to easily compare performance. Seescripts/benchmark_k
for examples. - Types of benchmarking:
- k: The number of markers to select from
- label_error: Given a range of percentages, pick that percent of points in the training + validation set and set their label to a random label form among the existing labels.
- To load the data, you can make use of the following functions:
get_citeseq
,get_mouse_brain
,get_paul
, andget_zeisel
. Note that bothget_mouse_brain
andget_paul
do some pre-processing, including removing outliers and normalizing the data in the case of Mouse Brain.
For Developers
- You will want to set up this library as an editable install.
- Clone the repository
git clone https://github.com/Computational-Morphogenomics-Group/MarkerMap.git
- Navigate to the MarkerMap directory
cd MarkerMap
- Locally install the package
pip install -e .
(may have to use pip3 if your system has both python2 and python3 installed)
- Clone the repository
- If you are going to be developing this package, also install the following:
pip install pre-commit pytest
- In the root directory, run
pre-commit install
. You should see a line likepre-commit installed at .git/hooks/pre-commit
. Now when you commit to your local branch, it will runjupyter nbconvert --clean-output
on all the local jupyter notebooks on that branch. This ensures that only clean notebooks are uploaded to the github. - To run tests, simply run pytest:
pytest
.
Adding Your Model
- The MarkerMap project is built to allow for easy comparison between different types of marker selection algorithms. If you would like to benchmark your method against the others, the process is straightforward:
- Suppose your method is called Acme. Create a class
AcmeWrapper
in other_models.py that extends Acme andBenchmarkableModel
from other_models.py. - In
AcmeWrapper
, implementbenchmarkerFunctional
, check out the other methods in that function for a full list of arguments. The most important two arecreate_kwargs
andtrain_kwargs
which are dictionaries of arguments. Thecreate_kwargs
is arguments that when be called when initializes the class, i.e.Acme(**train_kwargs)
, and thetrain_kwargs
are arguments that would be called when training the model. ThebenchmarkerFunctional
method should return an array of the indices of the markers. - Voila! Now you can define the model with
AcmeWrapper.getBenchmarker(create_kwargs, train_kwargs)
. This will return a model functional that you pass to thebenchmark
function. See quick_start_benchmark.py for an example.
- Suppose your method is called Acme. Create a class
- If you would like to add your super duper model to the repository, create a branch and submit a Pull Request.
License
- This project is licensed under the terms of the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file markermap-1.0.3.tar.gz
.
File metadata
- Download URL: markermap-1.0.3.tar.gz
- Upload date:
- Size: 51.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | daa9fec76e54e320b96dbce4baf728ba467bdc94f42f2c1f0479226e0c601421 |
|
MD5 | cc997f7fbbb1fa218671efed43f616af |
|
BLAKE2b-256 | ffa6ae5989c783222b4c3f3464cded5358e58596b9e39cdf578c17f200d5ec40 |
File details
Details for the file markermap-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: markermap-1.0.3-py3-none-any.whl
- Upload date:
- Size: 48.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.18
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c0da15dcd41c1584af1d14797ce270faab295fda94e0180f111d10a2f28bc3c8 |
|
MD5 | c81def877073a7e7013c1c4466ffe706 |
|
BLAKE2b-256 | f713155b29b4b4fc7c1edf0e3e24b75942cb589f0134e2e2fc3d69a16682cc48 |