Cellqc standardizes the qualiy control of single-cell RNA-Seq (scRNA) data to render clean feature count matrices.
Project description
cellqc: standardized quality control pipeline of single-cell RNA-Seq data
Cellqc standardizes the qualiy control of single-cell RNA-Seq (scRNA) data to render clean feature count matrices from Cell Ranger outputs. Cellqc is implemented using the Snakemake workflow management system to enhance reproduciblity and scalablity of data analysis. Briefly, the QC pipeline starts from raw count feature matrices from Cell Ranger. Dropkick filters out predicted empty droplets, and SoupX purify the transcriptome measurement by substracting the background trancripts. DoubletFinder further detects the potential doublets and retain clean count feature matrices for singlets. Cell types are annotated for clean cells by a reference database using scPred.
Installation
It is easy to install cellqc via conda at https://anaconda.org/bioconda/cellqc. To use the full function of cellqc, please also install several dependencies outside conda. It is encouraged to use the C++ implementation mamba to speed up the installation. E.g.,
conda config --add channels defaults --add channels bioconda --add channels conda-forge
# Downgrade Seurat to v4 for SeuratDisk, as Seurat v5 is not supported in SeuratDisk.
mamba create -y -n cellqc python=3.10 cellqc r-seurat=4 r-seuratobject=4 r-matrix=1.6.1 dropkick r-hdf5r hdf5 r-leidenbase libxml2 r-xml r-xml2 zlib bioconductor-rsamtools bioconductor-genomicfeatures bioconductor-rtracklayer 'pandas<2'
conda activate cellqc
# Build from source
Rscript -e "remotes::install_github(c('mojaveazure/seurat-disk', 'immunogenomics/harmony', 'powellgenomicslab/scPred', 'powellgenomicslab/DropletQC'), upgrade=F)"
# Bug fix @counts for Seurat object, instead of chris-mcginnis-ucsf/DoubletFinder
Rscript -e "remotes::install_github('lijinbio/DoubletFinder', upgrade=F, force=T)"
pip install -U cellqc # Optional: to install the latest version from PyPI
Dependent software are summarized below.
| Software | URL |
|---|---|
| DoubletFinder | https://github.com/chris-mcginnis-ucsf/DoubletFinder |
| DropletUtils | https://bioconductor.org/packages/release/bioc/html/DropletUtils.html |
| Seurat | https://satijalab.org/seurat |
| SeuratDisk | https://github.com/mojaveazure/seurat-disk |
| SoupX | https://github.com/constantAmateur/SoupX |
| scPred | https://github.com/powellgenomicslab/scPred |
| DropletQC | https://github.com/powellgenomicslab/DropletQC |
| Snakemake | https://github.com/snakemake/snakemake |
| Scanpy | https://scanpy.readthedocs.io/en/stable |
| dropkick | https://github.com/KenLauLab/dropkick |
To test the installation, simply run
cellqc -h
Run the pipeline
Cellqc requires a sample file for sample information and an optional configuration file for pipeline parameters.
-
The sample file (e.g.,
samples.txt) is a tab-delimited file with headers:sample,cellranger, and/ornreaction.- The
samplecolumn is the sample ID per sample. - The
cellrangeris the Cell Ranger output directory. See Cell Ranger Outputs for an example directory. - The optional third column
nreactionis the number of reactions in the library preparation, which is useful to infer expected doublets for a sample with a Cell Ranger analysis using combined raw reads from multiple reactions. If thenreactioncolumn is not specified in the sample file, the default 1 reaction is used for all samples.
- The
-
A configuration file is in the YAML format. It is optional. The default parameters can be used as below. See the next section for the inspection of configuration.
nuclear_fraction
numthreads: 12
cbtag: CB
retag: RE
dropkick:
skip: false
method: multiotsu
numthreads: 1
filterbycount:
mincount: 500
minfeature: 300
mito: 10
doubletfinder:
skip: false
findpK: false
numthreads: 5
pK: 0.01
scpred:
skip: true
reference: /path_to_reference/scPred_trainmodel_RNA_svmRadialWeights_scpred.rds
threshold: 0.9
Inspection of configuration
The configuration file is in a YAML format. An example configuration can be found at the example directory.
- dropkick
This section defines parameters for empty droplet removal by dropkick.
| Parameter | Description |
|---|---|
| dropkick.skip | Skip Dropkick and use the estimated cells from Cell Ranger alone (using EmptyDrops) if set true. If set false, to estimate further empty droplets by Dropkick. Be cautious that Dropdick might predict a significant number of false negatives for a poor library. |
| dropkick.method | The thresholding method for labeling the training data for true cells, such as multiotsu, otsu, li, or mean. |
| dropkick.numthreads | Number of threads. Dropkick will use significant memory. One thread is suggested for this step. |
- filterbycount
To filter cells by nCount, nFeature, and percentage of mitochondria reads.
| Parameter | Description |
|---|---|
| filterbycount.mincount | Minimum counts for a cell. |
| filterbycount.minfeature | Minimum features for a cell. |
| filterbycount.mito | Maximum percentage of mitocondria transcripts. |
- doubletfinder
This section includes three parameters for doublet removal by DoubletFinder.
| Parameter | Description |
|---|---|
| doubletfinder.skip | Skip doublet detection and removal. |
| doubletfinder.findpK | To estimate the neighbor size (pK) by mean-variance bimodality coefficients if true. If set false, skip the estimation and use the preset pK value. |
| doubletfinder.numthreads | Number of threads. |
| doubletfinder.pK | A preset neighbor size (pK). Will be used if doubletfinder.findpK=false. |
- scpred
A pre-trained classifier for cell-type annotation by scPred.
| Parameter | Description |
|---|---|
| scpred.skip | Skip the automated cell type prediction by scPred if true. This is useful for a sample without a pre-trained reference. |
| scpred.reference | The pre-trained reference classifier saved in a RDS file. See https://github.com/powellgenomicslab/scPred |
| scpred.threshold | Threshold for a positive prediction. |
Result files
Three result files are generated under a result subdirectory. result/*.h5seurat and result/*.h5ad files are count matrices after processing with QC metrics such as "pANN" for proportion of artificial nearest neighbors, and/or "scpred_prediction" for predicted cell type. A report file result/report.html is a summary of QC metrics. A postproc subdirectory with postproc/*.h5ad files is also generated for basic post-processing. This includes adding a prefix to the cell barcode, ensuring unique variable names, and cleaning the raw layer from the .h5ad file.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cellqc-0.1.0.tar.gz.
File metadata
- Download URL: cellqc-0.1.0.tar.gz
- Upload date:
- Size: 21.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15249e5a94971ab0850763b97530edaa30146e279bb4e0b627653dcd29d648c8
|
|
| MD5 |
76b22ba27b162220a434b601813ea360
|
|
| BLAKE2b-256 |
0302cd6cb3fbc724ff0e1f7980bb4198cfc3a6c976432a60c2acb94281b09761
|
File details
Details for the file cellqc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cellqc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 24.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d4903b5319ca5955f04d6371d7b754c6174c9a5079df28591e31a9747541624
|
|
| MD5 |
ae8c4804a60f775fa521ae5ec9567310
|
|
| BLAKE2b-256 |
e20059ab8444934cb88d687d80ebaabafcefe65da121ea497c7ad71c9d948167
|