Single cell embedded topic model for integrated scRNA-seq data analysis.
scETM: single-cell Embedded Topic Model
A generative topic model that facilitates integrative analysis of large-scale single-cell RNA sequencing data.
The full description of scETM and its application on published single cell RNA-seq datasets are available here.
This repository includes detailed instructions for installation and requirements, demos, and scripts used for the benchmarking of 7 other state-of-art methods.
- scETM: single-cell Embedded Topic Model
1 Model Overview
(a) Probabilistic graphical model of scETM. We model the scRNA-profile read count matrix yd,g in cell d and gene g across S subjects or studies by a multinomial distribution with the rate parameterized by cell topic mixture θ, topic embedding α, gene embedding ρ, and batch effects λ. (b) Matrix factorization view of scETM. (c) Encoder architecture for inferring the cell topic mixture θ.
Python version: 3.7+ scETM is included in PyPI, so you can install it by
pip install scETM
To enable GPU computing (which significantly boosts the performance), please install PyTorch with GPU support before installing scETM.
A step-by-step scETM tutorial can be found in here.
scETM requires a cells-by-genes matrix
adata as input, in the format of an AnnData object. Detailed description about AnnData can be found here.
By default, scETM looks for batch information in the 'batch_indices' column of the
adata.obs DataFrame, and cell type identity in the 'cell_types' column. If your data stores the batch and cell type information in different columns, pass them to the
cell_type_col arguments, respectively, when calling scETM functions.
A taste of scETM
from scETM import scETM, UnsupervisedTrainer, evaluate import anndata # Prepare the source dataset, Mouse Pancreas mp = anndata.read_h5ad("MousePancreas.h5ad") # Initialize model model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True) # The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging. trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP", ckpt_dir="../results") # Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches. trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4) # Obtain scETM cell, gene and topic embeddings. Unnormalized cell embeddings will be stored at mp.obsm['delta'], normalized cell embeddings at mp.obsm['theta'], gene embeddings at mp.varm['rho'], topic embeddings at mp.uns['alpha']. model.get_all_embeddings_and_nll(mp) # Evaluate the model and save the embedding plot evaluate(mp, embedding_key="delta", plot_fname="scETM_MP", plot_dir="figures/scETM_MP")
p-scETM is a variant of scETM where part or all of the the gene embedding matrix ρ is fixed to a pathways-by-genes matrix, which can be downloaded from the pathDIP4 pathway database. We only keep pathways that contain more than 5 genes.
If it is desired to fix the gene embedding matrix ρ during training, let trainable_gene_emb_dim be zero. In this case, the gene set used to train the model would be the intersection of the genes in the scRNA-seq data and the genes in the gene-by-pathway matrix. Otherwise, if trainable_gene_emb_dim is set to a positive value, all the genes in the scRNA-seq data would be kept.
from scETM import scETM, UnsupervisedTrainer, prepare_for_transfer import anndata # Prepare the source dataset, Mouse Pancreas mp = anndata.read_h5ad("MousePancreas.h5ad") # Initialize model model = scETM(mp.n_vars, mp.obs.batch_indices.nunique(), enable_batch_bias=True) # The trainer object will set up the random seed, optimizer, training and evaluation loop, checkpointing and logging. trainer = UnsupervisedTrainer(model, mp, train_instance_name="MP", ckpt_dir="../results") # Train the model on adata for 12000 epochs, and evaluate every 1000 epochs. Use 4 threads to sample minibatches. trainer.train(n_epochs=12000, eval_every=1000, n_samplers=4) # Load the target dataset, Human Pancreas hp = anndata.read_h5ad('HumanPancreas.h5ad') # Align the source dataset's gene names (which are mouse genes) to the target dataset (which are human genes) mp_genes = mp.var_names.str.upper() mp_genes.drop_duplicates(inplace=True) # Generate a new model and a modified dataset from the previously trained model and the mp_genes model, hp = prepare_for_transfer(model, hp, mp_genes, keep_tgt_unique_genes=True, # Keep target-unique genes in the model and the target dataset fix_shared_genes=True # Fix parameters related to shared genes in the model ) # Instantiate another trainer to fine-tune the model trainer = UnsupervisedTrainer(model, hp, train_instance_name="HP_all_fix", ckpt_dir="../results", init_lr=5e-4) trainer.train(n_epochs=800, eval_every=200)
If a Tensorboard SummaryWriter is passed to the
writer argument of the
UnsupervisedTrainer.train method, the package will store.
Release history Release notifications | RSS feed
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.