Skip to main content

Cellqc standardizes the qualiy control of single-cell RNA-Seq (scRNA) data to render clean feature count matrices.

Project description

cellqc: standardized quality control pipeline of single-cell RNA-Seq data

Cellqc standardizes the qualiy control of single-cell RNA-Seq (scRNA) data to render clean feature count matrices from Cell Ranger outputs. Cellqc is implemented using the Snakemake workflow management system to enhance reproduciblity and scalablity of data analysis. Briefly, the QC pipeline starts from raw count feature matrices from Cell Ranger. Dropkick filters out predicted empty droplets, and SoupX purify the transcriptome measurement by substracting the background trancripts. DoubletFinder further detects the potential doublets and retain clean count feature matrices for singlets. Cell types are annotated for clean cells by a reference database using scPred.

workflow

Installation

It is easy to install cellqc via conda at https://anaconda.org/bioconda/cellqc. To use the full function of cellqc, please also install several dependencies outside conda. It is encouraged to use the C++ implementation mamba to speed up the installation. E.g.,

conda create -n cellqc cellqc
conda activate cellqc
Rscript -e "remotes::install_github(c('chris-mcginnis-ucsf/DoubletFinder', 'mojaveazure/seurat-disk'))"
conda install anndata=0.7.8
conda install numpy=1.21 # by dropkick
pip install dropkick

Dependent software are summarized below.

Software URL
DoubletFinder https://github.com/chris-mcginnis-ucsf/DoubletFinder
DropletUtils https://bioconductor.org/packages/release/bioc/html/DropletUtils.html
Seurat https://satijalab.org/seurat
SeuratDisk https://github.com/mojaveazure/seurat-disk
SoupX https://github.com/constantAmateur/SoupX
scPred https://github.com/powellgenomicslab/scPred
Snakemake https://github.com/snakemake/snakemake
Scanpy https://scanpy.readthedocs.io/en/stable
dropkick https://github.com/KenLauLab/dropkick

To test the installation, simply run

$ cellqc -h

Run the pipeline

There are two ways to run the pipeline. One is to call the global rule by the installed cellqc, and the other is to copy a local rule and run the pipeline manually by snakemake. Both ways reqiure a configuration file in the YAML format for pipeline parameters, as well as a sample file for input Cell Ranger directories. See an example below.

Inspection of configuration

The configuration file is in a YAML format. An example configuration can be found at the example directory.

  1. samples

This is a sample file (e.g., samples.txt) tab-delimited with headers: sample, cellranger, and/or nreaction. The sample column is the sample ID per sample, and the cellranger is its Cell Ranger output directory. The third column nreaction is the number of reactions in the library preparation, which is useful to infer expected doublets for a sample with a Cell Ranger analysis using combined raw reads from multiple reactions. If the nreaction column is not specified in the sample file, the default 1 reaction is used for all samples.

  1. dropkick

This section defines two parameters for empty droplet removal by dropkick.

Parameter Description
dropkick.skip Skip Dropkick and use the estimated cells from Cell Ranger alone (using EmptyDrops) if set true. If set false, to estimate further empty droplets by Dropkick. Be cautious that Dropdick might predict a significant number of false negatives for a poor library.
dropkick.method The thresholding method for labeling the training data for true cells, such as multiotsu, otsu, li, or mean.
dropkick.numthreads Number of threads. Dropkick will use significant memory. One thread is suggested for this step.
  1. filterbycount

To filter cells by nCount, nFeature, and percentage of mitochondria reads.

Parameter Description
filterbycount.mincount Minimum counts for a cell.
filterbycount.minfeature Minimum features for a cell.
filterbycount.mito Maximum percentage of mitocondria transcripts.
  1. doubletfinder

This section includes three parameters for doublet removal by DoubletFinder.

Parameter Description
doubletfinder.findpK To estimate the neighbor size (pK) by mean-variance bimodality coefficients if true. If set false, skip the estimation and use the preset pK value.
doubletfinder.numthreads Number of threads.
doubletfinder.pK A preset neighbor size (pK). Will be used if doubletfinder.findpK=false.
  1. scpred

A pre-trained classifier for cell-type annotation by scPred.

Parameter Description
scpred.skip Skip the automated cell type prediction by scPred if true. This is useful for a sample without a pre-trained reference.
scpred.reference The pre-trained reference classifier saved in a RDS file. See https://github.com/powellgenomicslab/scPred
scpred.threshold Threshold for a positive prediction.

Result files

Three result files are generated under a result directory. result/*.h5seurat and result/*.h5ad files are count matrices after processing with QC metrics such as "pANN" for proportion of artificial nearest neighbors, and/or "scpred_prediction" for predicted cell type. A report file result/qc_report.html is a summary of QC metrics.

An example

This example demonstrates the pipeline on two AMD samples. The test data consists of Cell Ranger output directories of two AMD samples, as well as a pretrained calssifier for cell-type annotation.

https://bcm.box.com/s/nnlmgxh8avagje93cih20g1dsxx14if4

By feeding the file locations, below is an example configuration file config.yaml and a sample file sample.txt.

$ cat config.yaml
# samples with Cell Ranger output directories
samples: /path/to/samples.txt

## configuration for dropkick
dropkick:
  skip: true
  method: multiotsu
  numthreads: 1

## Filter cells by nCount, nFeature, and mito
filterbycount:
  mincount: 500
  minfeature: 300
  mito: 5

## configuration for DoubletFinder
doubletfinder:
  findpK: false
  numthreads: 5
  pK: 0.005

## configuration for scPred
scpred:
  skip: false
  reference: /path/to/scPred_reference.rds
  threshold: 0.9
$ cat sample.txt
sample	cellranger
AMD1	/path/to/cellqc_test_data/AMD1
AMD2	/path/to/cellqc_test_data/AMD2

Below command is to run the pipeline by the installed entrypoint cellqc.

$ cellqc -c config.yaml

A directed acyclic graph (DAG) of jobs will be generated. For example,

DAG

A report of result files will be also produced, such as report.html.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cellqc-0.0.4.tar.gz (19.4 kB view hashes)

Uploaded Source

Built Distribution

cellqc-0.0.4-py3-none-any.whl (23.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page