Single cell embedded topic model for integrated scRNA-seq data analysis.
Project description
scETM: single-cell Embedded Topic Model
A generative topic model that facilitates integrative analysis of large-scale single-cell RNA sequencing data.
The full description of scETM and its application on published single cell RNA-seq datasets are available here.
This repository includes detailed instructions for installation and requirements, demos, and scripts used for the benchmarking of 7 other state-of-art methods.
Contents
[toc]
1 Model Overview
(a) Probabilistic graphical model of scETM. We model the scRNA-profile read count matrix yd,g in cell d and gene g across S subjects or studies by a multinomial distribution with the rate parameterized by cell topic mixture θ, topic embedding α, gene embedding ρ, and batch effects λ. (b) Matrix factorization view of scETM. (c) Encoder architecture for inferring the cell topic mixture θ.
2 Installation
Python version: 3.7+ scETM is included in PyPI, so you can just install by
pip install scETM
To enable GPU computing (which significantly boosts the performance), please install PyTorch with GPU support before installing scETM.
3 Usage
A step-by-step scETM tutorial can be found in here.
Required data
scETM requires a cells-by-genes matrix adata
as input, in the format of an AnnData object. Detailed description about AnnData can be found here.
By default, scETM looks for batch information in the 'batch_indices' column of the adata.obs
DataFrame, and cell type identity in the 'cell_types' column. If your data stores the batch and cell type information in different columns, pass them to the batch_col
and cell_types_col
arguments, respectively, when calling scETM functions.
scETM and p-scETM
There are two flavors, scETM and pathway-informed scETM (p-scETM). The difference is that, in p-scETM, the gene embedding \rho is fixed to a pathways-by-genes matrix, which can be downloaded from the pathDIP4 pathway database. We only keep pathways that contain more than 5 genes.
- scETM
from scETM.model import scETM
from scETM.trainer import UnsupervisedTrainer, prepare_for_transfer
import anndata
# Prepare the source dataset, Mouse Pancreas
mp = anndata.read_h5ad("MousePancreas.h5ad")
# Initialize model
model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True)
# The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging.
trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP_scETM", ckpt_dir="../results")
# Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches.
trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4)
# Load the target dataset, Human Pancreas
hp = anndata.read_h5ad('HumanPancreas.h5ad')
# Align the source dataset's gene names (which are mouse genes) to the target dataset (which are human genes)
mp_genes = mp.var_names.str.upper()
mp_genes.drop_duplicates(inplace=True)
# Generate a new model and a modified dataset from the previously trained model and the mp_genes
model, hp = prepare_for_transfer(model, mp_genes, hp,
keep_tgt_unique_genes=True, # Keep target-unique genes in the model and the target dataset
fix_shared_genes=True # Fix parameters related to shared genes in the model
)
# Instantiate another trainer to fine-tune the model
trainer = UnsupervisedTrainer(model, hp, train_instance_name="HP_all_fix", ckpt_dir="../results", init_lr=5e-4)
trainer.train(n_epochs=800, eval_every=200)
- p-scETM The gene-by-pathway matrix (with row and column names) is stored in a csv file specified by the "pathway-csv-path" argument. The gene names in the gene-by-pathway matrix must correspond to those in the scRNA-seq data for the program to merge the two sources. If it is desired to fix the gene embedding matrix $\rho$ during training, let "trainable-gene-emb-dim" be zero. In this case, the gene set used to train the model would be the intersection of the genes in the scRNA-seq data and the genes in the gene-by-pathway matrix. Otherwise, if "trainable-gene-emb-dim" is set to a positive value, all the genes in the scRNA-seq data would be kept.
$ python train.py \
--model scETM \
--norm-cells \
--batch-scaling \
--h5ad-path data/HumanPancreas.h5ad \
--pathway-csv-path data/pathdipv4_morethan5.csv \
--n-epochs 800 \
--log-every 400 \
--ckpt-dir results/ \
--save-embeddings \
--trainable-gene-emb-dim 0 # fixing the gene embedding \rho
4 Benchmarking
The commands used for running Harmony, Scanorama, Seurat, scVAE-GM, scVI, LIGER, scVI-LD are available in the baselines folder.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.