Skip to main content

scmomat: a single-cell multi-omics mosaic integration method

Project description

scMoMaT: a unified framework for single cell mosaic integration and multi-modal bio-marker detection

scMoMaT v0.2.0

Description

scMoMaT is a single-cell data integration method that is able to:

  • integrate single cell multi-omics data under the mosaic scenario using matrix tri-factorization
  • uncover the cell type specific bio-markers at the same time when learning a unified cell representation
  • integrate cell batches with unequal cell type composition

Dependency

Python >= 3.8.10
numpy >= 1.19.5
torch >= 1.11.0
scipy >= 1.7.3
sklearn >= 0.22.1
matplotlib >= 3.5.2
umap >= 0.5.2
leidenalg >= 0.9.1
pandas >= 1.4.0
adjustText >= 0.7.3 (optional)

Directory

  • src contains the main script of scMoMaT
  • test contains the testing script of scMoMaT on the datasets in the manuscript and running script of baseline methods. Please refer to this folder for the reproduction of test results. The testing dataset is available upon request.
  • data stores the datasets

Data

Necessary data for PBMC (the first real dataset in the manuscript, ASAP-PBMC) and MOp (the second real dataset in the manuscript, MOp_5batches) are provided, which should be suffice for the running of scMoMaT in demo_pbmc.py and demo_mop_5batches.py as examples. The dataset for the other demo scrips are available upon requests.

Installation

Install the package with

pip install scmomat

Or install the latest version by cloning the repository with

git clone https://github.com/PeterZZQ/scMoMaT.git

Enter the root directory of the package

cd scMoMaT

And run

pip install .

Uninstall using

pip uninstall scmomat

Usage

Training pipeline of scMoMaT

Data object count

Make sure that the dataset get stored in the dictionary, and save the value to each key: * nbatches: total number of batches * feats_name: a dictionary storing the feature names for different modalities, e.g. {"rna": np.array([aaaa, bbbb,...]), "atac": np.array([ccc, ddd, ...]), "protein": np.array([eee, fff, ...])} * modality 1 (e.g. rna in the example below): a list store all data matrices correspond to modality 1, a element correspond to one batch, elements are ordered in the list following the ordering of batches. The batches that does not have count matrix in corresponding modality is inserted None as placeholder * modality 2 (e.g. atac in the example below): requirement the same as above. * modality 3 (e.g. protein): requirement the same as above. * ...

Check demo_scmomat.ipynb for an example. Note: * The number of item in the feats_name should match the number of modalities in counts. * The length of lists in modality 1, modality 2, modality 3, etc should have the same length, which is equal to nbatches. (missing matrices are denoted as None as explained above). The matrices must be ordered according to the batch ID in each list, so that scMoMaT can detact parallel sequenced batches. * The data matrix in each modality (each list) should have the same set of features. You can do thie by 1. using the intersection of all genes/proteins in different batches; 2. Remap the chromatin regions according to the peaks of one batch. * The data matrix is of the shape (ncells, nfeatures), and preprocessed with utils.preprocess() function.

One example template is shown as below, note that the features of data matrices are matched in advance. The connection matrix (GxR.npz) that is used to generate pseudo-count is also provided in advance (used the code in folder calc_pseudo_count/calc_pseudo_count.R). Feel free to modify on the template to use your dataset.

First training stage
import scmomat
# training device: CPU or GPU
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

# number of latent dimensions (20~30)
K = 30
# number of iterations
T = 4000
# [Optional] parameter lambda is set to be the default value 0.001

# 1st stage training, learning cell factors
model = scmomat.scmomat_model(counts = counts, K = K, device = device)
losses = model.train_func(T = T)

# extract cell factors
zs = model.extract_cell_factors()
Cell factor post-processing
# post-processing
# neighborhood size (30~50)
n_neighbors = 30
# [Optional] neighborhood radius (between 0 and 1)
r = None
# number of parallel CPU for calculation, default 1
njobs = 8
# obtain post-processed graph
s_pair_dist, knn_indices, knn_dists = scmomat.post_process(zs, n_neighbors, njobs = 8, r = r)
Cluster and retraining
# cluster with post-processed graph
# cluster resolution
resolution = 0.9
labels_leiden = scmomat.leiden_cluster(X = None, knn_indices = knn_indices, knn_dists = knn_dists, resolution = resolution)
# visualize graph with umap
umap_op = scmomat.UMAP(n_components = 2, n_neighbors = n_neighbors, min_dist = 0.20, random_state = 0, metric='precomputed', knn_dists=knn_dists, knn_indices=knn_indices)
x_umap = umap_op.fit_transform(s_pair_dist)
scmomat.plot_latent(x_umap, annos = labels_leiden, mode = "joint", save = None, figsize = (10,7), axis_label = "UMAP", markerscale = 6, s = 5, label_inplace = True, alpha = 0.7)

# 2nd stage training with cluster result, obtained feature factors
# [Optional] parameter lambda is set to be the default value 0.001
# [Optional] Update counts to include more training modalities (motif)
model2 = scmomat.scmomat_retrain(model = model, counts =  counts, labels = labels_leiden, device = device)
# number of iteractions is 2000
losses = model2.train(T = 2000)

# extract marker scores
score_feats = model2.extract_feature_factors()
Examples

The above example shows the key steps in scmomat, which does not include the ploting function. demo_scmomat.ipynb provides a detailed example run on MOp_5batches dataset (running time: training procedure takes ~5mins on NVIDIA A40 GPU, total file takes ~30mins).

For more example, please see the test scripts in test folder, necessary comments are included:

  • test_pbmc.py: The first real dataset in the manuscript (data provided).
  • test_bmmc_healthy.py: The third real dataset in the maunscript.
  • test_spleen.py: The fourth real dataset in the manuscript.
  • test_spleen_subsample.py: The fourth real dataset in the manuscript.
  • test_mop_5batches.py: The second real dataset in the manuscript.

Contact

Cite

@article{zhang2022scmomat,
  title={scMoMaT: Mosaic integration of single cell multi-omics matrices using matrix trifactorization},
  author={Zhang, Ziqi and Sun, Haoran and Chen, Xinyu and Mariappan, Ragunathan and Chen, Xi and Jain, Mika and Efremova, Mirjana and Rajan, Vaibhav and Teichmann, Sarah and Zhang, Xiuwei},
  journal={bioRxiv},
  year={2022},
  publisher={Cold Spring Harbor Laboratory},
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scmomat-0.2.2.tar.gz (118.3 kB view details)

Uploaded Source

Built Distribution

scmomat-0.2.2-py3-none-any.whl (49.0 kB view details)

Uploaded Python 3

File details

Details for the file scmomat-0.2.2.tar.gz.

File metadata

  • Download URL: scmomat-0.2.2.tar.gz
  • Upload date:
  • Size: 118.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.5

File hashes

Hashes for scmomat-0.2.2.tar.gz
Algorithm Hash digest
SHA256 2c7802a4fa0b213f5740f56516433824c5152f5f1cce9c4ca095f1bed3f8dc65
MD5 2d7909cd26295165c043eb95e8e4545c
BLAKE2b-256 cb893c2532ef7eb7c8d7c47334595e9bb60233e19ac577e2abcea801d71b7dfc

See more details on using hashes here.

File details

Details for the file scmomat-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: scmomat-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 49.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.9.5

File hashes

Hashes for scmomat-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5c75b6ba1484c1d56935ebaf7ec076a0f9dd3dcb349f69155627bb13bc783a33
MD5 fd0f3ba5d000f45c8cf2765747d99bbb
BLAKE2b-256 a0c5e5966048b104d7b9d8bbaa0b973bbff3505d33a0b1b80b153048cdb8a563

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page