Neural network classifiers for doublets

These details have not been verified by PyPI

Project links

Environment
- Console
Intended Audience
- Science/Research
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

solo -- Doublet detection via semi-supervised deep learning

Why

Cells subjected to single cell RNA-seq have been through a lot, and they'd really just like to be alone now, please. If they cannot escape the cell social scene, you end up sequencing RNA from more than one cell to a barcode, creating a doublet when you expected single cell profiles. https://www.cell.com/cell-systems/fulltext/S2405-4712(20)30195-2

solo is a neural network framework to classify doublets, so that you can remove them from your data and clean your single cell profile.

We benchmarked solo against other doublet detection tools such as DoubletFinder and Scrublet, and found that it consistently outperformed them in terms of average precision. Additionally, Solo performed much better on a more complex tissue, mouse kidney.

Quick set up

Run the following to clone and set up ve. git clone git@github.com:calico/solo.git && cd solo && conda create -n solo python=3.6 && conda activate solo && pip install -e .

Or install via pip conda create -n solo python=3.7 && conda activate solo && pip install solo-sc

If you don't have conda follow the instructions here: https://docs.conda.io/projects/conda/en/latest/user-guide/install/

How to solo

usage: solo [-h] [-d DOUBLET_DEPTH] [-g] [-o OUT_DIR] [-r DOUBLET_RATIO]
            [-s SEED] [-k KNOWN_DOUBLETS] [-t {multinomial,average,sum}]
            [-e EXPECTED_NUMBER_OF_DOUBLETS] [-p]
            model_json_file data_file

positional arguments:
  model_json_file       json file to pass VAE parameters
  data_file             path to h5ad, loom or 10x directory containing cell by
                        genes counts

optional arguments:
  -h, --help            show this help message and exit
  -d DOUBLET_DEPTH      Depth multiplier for a doublet relative to the average
                        of its constituents (default: 2.0)
  -g                    Run on GPU (default: True)
  -o OUT_DIR
  -r DOUBLET_RATIO      Ratio of doublets to true cells (default: 2.0)
  -s SEED               Path to previous solo output directory. Seed VAE
                        models with previously trained solo model. Directory
                        structure is assumed to be the same as solo output
                        directory structure. should at least have a vae.pt a
                        pickled object of vae weights and a latent.npy an
                        np.ndarray of the latents of your cells. (default:
                        None)
  -k KNOWN_DOUBLETS     Experimentally defined doublets tsv file. Should be a
                        single column of True/False. True indicates the cell
                        is a doublet. No header. (default: None)
  -t {multinomial,average,sum}
                        Please enter multinomial, average, or sum (default:
                        multinomial)
  -e EXPECTED_NUMBER_OF_DOUBLETS
                        Experimentally expected number of doublets (default:
                        None)
  -p                    Plot outputs (default: True)

Warning: If you are going directly from cellranger 10x output you may want to manually inspect your data prior to running solo.

model_json example:

{
  "n_hidden": 384,
  "n_latent": 64,
  "n_layers": 1,
  "cl_hidden": 128,
  "cl_layers": 1,
  "dropout_rate": 0.2,
  "learning_rate": 0.001,
  "valid_pct": 0.10
}

Outputs:

is_doublet.npy np boolean array, true if a cell is a doublet, differs from preds.npy if -e expected_number_of_doublets parameter was used
vae.pt scVI weights for vae
classifier.pt scVI weights for classifier
latent.npy latent embedding for each cell
preds.npy doublet predictions
softmax_scores.npy updated softmax of doublet scores (see paper)
no_update_softmax_scores.npy raw softmax of doublet scores
logit_scores.npy logit of doublet scores
real_cells_dist.pdf histogram of distribution of doublet scores
accuracy.pdf accuracy plot test vs train
train_v_test_dist.pdf doublet scores of test vs train
roc.pdf roc of test vs train
softmax_scores_sim.npy see above but for simulated doublets
logit_scores_sim.npy see above but for simulated doublets
preds_sim.npy see above but for simulated doublets
is_doublet_sim.npy see above but for simulated doublets

How to demultiplex cell hashing data using HashSolo CLI

Demultiplexing takes as input an h5ad file with only hashing counts. Counts can be obtained from your fastqs by using kite. See tutorial here: https://github.com/pachterlab/kite

usage: hashsolo [-h] [-j MODEL_JSON_FILE] [-o OUT_DIR] [-c CLUSTERING_DATA]
                [-p PRE_EXISTING_CLUSTERS] [-q PLOT_NAME]
                [-n NUMBER_OF_NOISE_BARCODES]
                data_file

positional arguments:
  data_file             h5ad file containing cell hashing counts

optional arguments:
  -h, --help            show this help message and exit
  -j MODEL_JSON_FILE    json file to pass optional arguments (default: None)
  -o OUT_DIR            Output directory for results (default:
                        hashsolo_output)
  -c CLUSTERING_DATA    h5ad file with count transcriptional data to perform
                        clustering on (default: None)
  -p PRE_EXISTING_CLUSTERS
                        column in cell_hashing_data_file.obs to specifying
                        different cell types or clusters (default: None)
  -q PLOT_NAME          name of plot to output (default: hashing_qc_plots.pdf)
  -n NUMBER_OF_NOISE_BARCODES
                        Number of barcodes to use to create noise distribution
                        (default: None)

model_json example:

{
  "priors": [0.01, 0.5, 0.49]
}

Priors is a list of the probability of the three hypotheses, negative, singlet, or doublet that we test when demultiplexing cell hashing data. A negative cell's barcodes doesn't have enough signal to identify its sample of origin. A singlet has enough signal from single hashing barcode to associate the cell with ins originating sample. A doublet is a cell barcode which has signal for more than one hashing barcode. Depending on how you processed your cell hashing matrix before hand you may want to set different priors. Under the assumption that you have subset your cell barcodes using typical QC on your cell by genes matrix, e.g. min UMI counts, percent mitochondrial reads, etc. We found the above setting of prior performed well (see paper). If you have only done relatively light QC in transcriptome space I'd suggest an even prior, e.g. [1./3., 1./3., 1./3.].

Outputs:

hashsoloed.h5ad anndata with demultiplexing information in .obs
hashing_qc_plots.png plots of probabilites for each cell

How to demultiplex cell hashing data using HashSolo in line

>>> from solo import hashsolo
>>> import anndata
>>> cell_hashing_data = anndata.read("cell_hashing_counts.h5ad")
>>> hashsolo.hashsolo(cell_hashing_data)
>>> cell_hashing_data.obs.head()
                  most_likeli_hypothesis  cluster_feature  negative_hypothesis_probability  singlet_hypothesis_probability  doublet_hypothesis_probability         Classification
index                                                                                                                                                                            
CCTTTCTGTCCGAACC                       2                0                     1.203673e-16                        0.000002                        0.999998                Doublet
CTGATAGGTGACTCAT                       1                0                     1.370633e-09                        0.999920                        0.000080  BatchF-GTGTGACGTATT_x
AGCTCTCGTTGTCTTT                       1                0                     2.369380e-13                        0.996992                        0.003008  BatchE-GAGGCTGAGCTA_x
GTGCGGTAGCGATGAC                       1                0                     1.579405e-09                        0.999879                        0.000121  BatchB-ACATGTTACCGT_x
AAATGCCTCTAACCGA                       1                0                     1.867626e-13                        0.999707                        0.000293  BatchB-ACATGTTACCGT_x
>>> demultiplex.plot_qc_checks_cell_hashing(cell_hashing_data)

most_likeli_hypothesis 0 == Negative, 1 == Singlet, 2 == Doublet
cluster_feature how the cell hashing data was divided if specified or done automatically by giving a cell by genes anndata object to the cluster_data argument when calling demultiplex_cell_hashing
negative_hypothesis_probability
singlet_hypothesis_probability
doublet_hypothesis_probability
Classification The sample of origin for the cell or whether it was a negative or doublet cell.

Project details

These details have not been verified by PyPI

Project links

Environment
- Console
Intended Audience
- Science/Research
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

1.3

Apr 4, 2022

1.2

Dec 8, 2021

1.1

Oct 15, 2021

1.0

Jun 30, 2021

This version

0.6

Jul 2, 2020

0.5

Jun 2, 2020

0.4

Jun 2, 2020

0.3

Apr 8, 2020

0.2

Jan 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

solo-sc-0.6.tar.gz (21.3 kB view details)

Uploaded Jul 2, 2020 Source

File details

Details for the file solo-sc-0.6.tar.gz.

File metadata

Download URL: solo-sc-0.6.tar.gz
Upload date: Jul 2, 2020
Size: 21.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.1.3.post20200325 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.6

File hashes

Hashes for solo-sc-0.6.tar.gz
Algorithm	Hash digest
SHA256	`d929612a3452f44f2585d6b49317ee46e703a5104d282c90555d7854827d42a3`
MD5	`272b4f78cf5a8aa0258f9a835b9d49d1`
BLAKE2b-256	`f18c0907f97db8fcb7aa5ba9be2c7a7f82fcf5b77185f3114891ca3172da8906`

See more details on using hashes here.

solo-sc 0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

solo -- Doublet detection via semi-supervised deep learning

Why

Quick set up

How to solo

How to demultiplex cell hashing data using HashSolo CLI

How to demultiplex cell hashing data using HashSolo in line

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes