Skip to main content

A deep learning-based clustering method for single-cell multi-omics data

Project description

MoClust

A pytorch implement of single-cell multi-omics clustering method MoClust.

Abstract

Single-cell multiomics sequencing techniques have rapidly developed in the past few years. Analyzing single-cell multiomics data may give us novel perspectives to dissect cellular heterogeneity, yet integrative analysis remains challenging. The inherited high-dimensional and highly sparse omics data making it a great difficulty to reduce the dimension of each omic data. And existing integration methods are mostly stumped by aligning the omic-specific latent features and obtaining a cell state representation well suited for clustering.

We present MoClust, a novel joint clustering methods that can be applied to several types of single-cell multiomics data. Introducing a contrastive learning based alignment technique, MoClust is able to to learn common representations that well suited for clustering, while simultaneously considering the topology structure of latent features. Furthermore,we proposed a novel automatic doublet discovery module that can efficiently find doublets without manually setting a threshold. Extensive experiments demonstrated the powerful alignment and clustering ability of MoClust.

Environment

python >= 3.7

  • scanpy == 1.6.0
  • numpy == 1.21.6
  • pandas == 1.3.5
  • torch == 1.10.2+cu102
  • sklearn == 1.0.2
  • scipy == 1.4.1
  • seaborn == 0.9.0
  • tabulate = 0.8.9
  • typing
  • pydantic

Data Format

Before we get started, we need to preprocess your CITE-seq or SNARE-seq data

- RNA data -- a cell x gene csv file
- Protein data -- a cell x protein csv file
- ATAC data --  a cell x peak csv file
    - the columns of ATAC file should be like chr1:56782095-56782395

A gtf file compatible with your data is also needed when training MoClust over SNARE-seq data

Train MoClust over Multi-Omics data

We provide an example CITE-seq data with ground truth labels, you can train MoClust over it by

python main_citeseq --RNA_raw_matrix='/rna_mat.csv' --ADT_raw_matrix='/prt_mat.csv -- have_labels=True --labels_path='/labels.csv'

You can train MoClust over un-annotated CITE-seq data by

python main_citeseq --RNA_raw_matrix='/rna.csv' --ADT_raw_matrix='/adt.csv

You can train MoClust over un-annotated SNARE-seq data by

python main_snareseq --RNA_raw_matrix='/rna.csv' --ATAC_raw_matrix='/atac.csv --gtf='/gencode.v39.annotation.gtf'

Parameters of Moclust

The list of parameters is given below:

  • RNA_raw_matrix: the path of rna matrix csv file

  • ADT_raw_matrix: the path of protein matrix csv file

  • have_labels: have ground truth or not

  • labels_path: the path of ground truth csv file

  • highly_genes: the number of highly variable genes to be selected

  • device: the number of cuda device to be used

  • model_savepath: the path of the pth file to save the trained model

  • results_savepath: the path of a folder to save results

MoClust Model Parameters:

  • nclusters: the number of clusters

  • encoder_rna_layer: the dimensions of hidden layers of RNA encoder, default as [256,64,32]

  • encoder_adt_layer: the dimensions of hidden layers of protein encoder, default as [32]

  • use_bn: Use batch norm or not in the DDC module

  • nhidden: the dimension of the hidden layer in DDC module, default as 16

Training settings:

  • batch_size:default as 256

  • lr: learning rate, default as 1e-3

  • max_epoch: max training epoch, default as 200

  • test_interval: test frequency, default as 10

Hyper-parameters:

  • loss_weights: the weights of loss terms ddc_1|ddc_2|ddc_3|zinb_1|contrast, default as [1.0,1.0,1.0,1.0,1.0]

  • rel_sigma: sigma value used when calculating similarity matrix K in Eq (9)(10), default as 0.1

  • tau: tau value used when calculating cosine similarity between latent representations in Eq (6), default as 0.1

  • delta: constrains the strength of contrastive loss in Eq (13), default as 0.1

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moclust-0.0.1.tar.gz (24.1 kB view hashes)

Uploaded Source

Built Distribution

moclust-0.0.1-py3-none-any.whl (27.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page