Toolkit for analysis and identification of cell types from heterogeneous single cell RNA-seq data

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Digital Cell Sorter

Digital Cell Sorter (DCS): a single cell RNA-seq analysis toolkit for clustering, cell type identification, and anomaly detection.

Note: We are currently preparing a manuscript describing the toolkit located this repository. If you want to access the package detailed in our latest publication of Polled Digital Cell Sorter go to https://zenodo.org/record/2603265 and download the package (v1.1).

The latest publication describing the methodology of cell types identification: Polled Digital Cell Sorter (p-DCS): Automatic identification of hematological cell types from single cell RNA-sequencing clusters Sergii Domanskyi, Anthony Szedlak, Nathaniel T Hawkins, Jiayin Wang, Giovanni Paternostro & Carlo Piermarocchi, BMC Bioinformatics volume 20, Article number: 369 (2019)

The documentation is available at https://digital-cell-sorter.readthedocs.io/.

Getting Started
Functionality
- Overall
- Visualization
Demo
- Usage
  - Main cell types
  - Cell sub-types
- Output

Getting Started

These instructions will get you a copy of the project up and running on your machine for data analysis, development or testing purposes.

Prerequisites

The code runs in Python >= 3.7 environment.

It is highly recommended to install Anaconda. Installers are available at https://www.anaconda.com/distribution/

Our software uses packages numpy, pandas, matplotlib, scikit-learn, scipy, mygene, fftw, fitsne, adjustText and a few other standard Python packages. Install DigitalCellSorter with pip. Most of these necessary packages are automatically installed with installation of the latest release of DigitalCellSorter:

pip install DigitalCellSorter

Alternatively, you can install this module directly from GitHub using:

pip install git+https://github.com/sdomanskyi/DigitalCellSorter

Also one can create a local copy of this project for development purposes, and install the package from the cloned directoryg:

git clone https://github.com/sdomanskyi/DigitalCellSorter
python setup.py install

Some of the packages used in DigitalCellSorter are not installed by default, and should by installed by separately if using certain functionality with Digital Cell Sorter. For example, for use of network-based clustering install packages pynndescent, networkx, python-louvain. To use UMAP layout install umap-learn, and for PHATE install phate.

To use tSNE layout the following need to be installed. First install fftw from the conda-forge channel add conda-forge to your channels, and install fftw:

conda config --add channels conda-forge
conda install fftw

Then to install FI-tSNE for Linux:

pip install fitsne

On macOS Mojave:

env CC=clang CXX=clang++ pip install fitsne

On Windows the FI-tSNE is already included with DigitalCellSorter. Note, if neither fitsne, umap nor phate are installed the DigitalCellSorter defaults to PCA first two principal components for visualization layout.

To use Sankey diagrams that are part of Digital Cell Sorter install plotly and orca:

conda install -c plotly plotly-orca

Loading the package

In your script import the package:

import DigitalCellSorter

Create an instance of class DigitalCellSorter. Here, for simplicity, we use Default parameter values:

DCS = DigitalCellSorter.DigitalCellSorter()

During the initialization a number of parameters can be specified. For detailed list see documentation. Many of these parameters are transfered to DCS attributes thus can be modified after initialization using, e.g.:

DCS.toggleMakeStackedBarplot = False

Gene Expression Data Format

The input gene expression data is expected in one of the following formats:

Spreadsheet of comma-separated values csv containing condensed matrix in a form ('cell', 'gene', 'expr'). If there are batches in the data the matrix has to be of the form ('batch', 'cell', 'gene', 'expr'). Columns order can be arbitrary.

Examples:

cell	gene	expr
C1	G1	3
C1	G2	2
C1	G3	1
C2	G1	1
C2	G4	5
...	...	...

or:

batch	cell	gene	expr
batch0	C1	G1	3
batch0	C1	G2	2
batch0	C1	G3	1
batch1	C2	G1	1
batch1	C2	G4	5
...	...	...	...

Spreadsheet of comma-separated values csv where rows are genes, columns are cells with gene expression counts. If there are batches in the data the spreadsheet the first row should be 'batch' and the second 'cell'.

Examples:

cell	C1	C2	C3	C4
G1		3	1	7
G2	2	2		2
G3	3	1		5
G4	10		5	4
...	...	...	...	...

or:

batch	batch0	batch0	batch1	batch1
cell	C1	C2	C3	C4
G1		3	1	7
G2	2	2		2
G3	3	1		5
G4	10		5	4
...	...	...	...	...

Pandas DataFrame where axis 0 is genes and axis 1 are cells. If the are batched in the data then the index of axis 1 should have two levels, e.g. ('batch', 'cell'), with the first level indicating patient, batch or expreriment where that cell was sequenced, and the second level containing cell barcodes for identification.

Examples:

df = pd.DataFrame(data=[[2,np.nan],[3,8],[3,5],[np.nan,1]], 
                  index=['G1','G2','G3','G4'], 
                  columns=pd.MultiIndex.from_arrays([['batch0','batch1'],['C1','C2']], names=['batch', 'cell']))

Pandas Series where index should have two levels, e.g. ('cell', 'gene'). If there are batched in the data the first level should be indicating patient, batch or expreriment where that cell was sequenced, the second level cell barcodes for identification and the third level gene names.

Examples:

se = pd.Series(data=[1,8,3,5,5], 
               index=pd.MultiIndex.from_arrays([['batch0','batch0','batch1','batch1','batch1'],
                                                ['C1','C1','C1','C2','C2'],
                                                ['G1','G2','G3','G1','G4']], names=['batch', 'cell', 'gene']))

Any of the data types outlined above need to be prepared/validated with a function prepare(). Let us demonstrate this on the input of type 1:

df_expr = DCS.prepare('data/testData/dataFileCondensedWithBatches.tsv')

Other Data

markersDCS.xlsx: An excel book with marker data. Rows are markers and columns are cell types. '1' means that the gene is a marker for that cell type, '-1' means that this gene is not expressed in this cell type, and '0' otherwise. This gene marker file included in the package is used by Default. If you use your own file it has to be prepared in the same format (including the two-line header). Note that only the first worksheet will be read, and its name can be arbitrary. The first column should contain gene names. The second row should contain cell types, and the first row how those cell types are grouped. If any of the cell types need to be skipped, have "NA" in the corresponding cell of the first row of that cell type.

Example:

A	B	C	D	E	F	G	H	I	J	K	L	M	...
	B cells	B cells	B cells	T cells	T cells	T cells	T cells	T cells	T cells	T cells	NK cells	NK cells	...
Marker	B cells naive	B cells memory	Plasma cells	T cells CD8	T cells CD4 naive	T cells CD4 memory resting	T cells CD4 memory activated	T cells follicular helper	T cells regulatory (Tregs)	T cells gamma delta	NK cells resting	NK cells activated	...
ABCB4	1	0	0	0	0	0	0	0	0	0	0	0	...
ABCB9	0	0	1	0	0	0	0	0	0	0	0	0	...
ACAP1	0	0	0	0	1	0	0	0	0	0	0	0	...
ACHE	0	0	0	0	0	0	0	0	0	0	0	0	...
ACP5	0	0	0	0	0	0	0	0	0	0	0	0	...
ADAM28	1	1	0	0	0	0	0	0	0	0	0	0	...
ADAMDEC1	0	0	0	0	0	0	0	0	0	0	0	0	...
ADAMTS3	0	0	0	0	0	0	0	0	0	0	0	0	...
ADRB2	0	0	0	0	0	0	0	0	0	0	0	0	...
AIF1	0	0	0	0	0	0	0	0	0	0	0	0	...
AIM2	0	1	0	0	0	0	0	0	0	0	0	0	...
ALOX15	0	0	0	0	0	0	0	0	0	0	0	0	...
ALOX5	0	1	0	0	0	0	0	0	0	0	0	0	...
AMPD1	0	0	1	0	0	0	0	0	0	0	0	0	...
ANGPT4	0	0	1	0	0	0	0	0	0	0	0	0	...
...	...	...	...	...	...	...	...	...	...	...	...	...	...

Human.MitoCarta2.0.csv: An csv spreadsheet with human mitochondrial genes, created within work MitoCarta2.0: an updated inventory of mammalian mitochondrial proteins Sarah E. Calvo, Karl R. Clauser, Vamsi K. Mootha, Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016.

Functionality

Overall

The main class, DigitalCellSorter, includes tools for:

Pre-preprocessing
Quality control
Batch effects correction
Cells anomaly score evaluation
Dimensionality reduction
Clustering
Annotating cell types
Vizualization
Post-processing.

Visualization

Function visualize() will produce most of the necessary files for post-analysis of the data.

See examples of the visualization tools below.

The visualization tools include:

makeMarkerExpressionPlot(): a heatmap that shows all markers and their expression levels in the clusters, in addition this figure contains relative (%) and absolute (cell counts) cluster sizes

getIndividualGeneExpressionPlot(): 2D layout colored by individual gene's expression

makeVotingResultsMatrixPlot(): z-scores of the voting results for each input cell type and each cluster, in addition this figure contains relative (%) and absolute (cell counts) cluster sizes

makeHistogramNullDistributionPlot(): null distribution for each cluster and each cell type illustrating the "machinery" of the Digital Cell Sorter

makeQualityControlHistogramPlot(): Quality control histogram plots

makeProjectionPlot(): 2D layout colored by number of unique genes expressed, number of counts measured, and a faraction of mitochondrial genes..

Effect of batch correction demostrated on combining BM1, BM2, BM3 and processing the data jointly without (left) and with (right) batch correction option:

makeStackedBarplot(): plot with fractions of various cell types

makeSankeyDiagram(): river plot to compare various results

getAnomalyScoresPlot(): plot with anomaly scores per cell

Calculate and plot anomaly scores for an arbitrary cell type or cluster:

getIndividualGeneTtestPlot(): Produce heatmap plot of t-test p-Values calculated gene-pair-wise from the annotated clusters

makePlotOfNewMarkers(): genes significantly expressed in the annotated cell types

Demo

Usage

We have made an example execution file demo.py that shows how to use DigitalCellSorter.

In the demo, folder data is intentionally left empty. The data file (cc95ff89-2e68-4a08-a234-480eca21ce79.homo_sapiens.mtx.zip) is about 2.4Gb in size and will be downloaded with the demo.py script.

Previously the HCA preview data was consolidated in file ica_bone_marrow_h5.h5 and downloadable
from https://preview.data.humancellatlas.org/ (Raw Counts Matrix - Bone Marrow). That file was ~485Mb and containing 378000 cells from 8 bone marrow donors (BM1-BM8).

See details of the script demo.py at:

Example walkthrough of demo.py script

To execute the complete script demo.py run:

python demo.py

*Note that the HCA BM1 data contains ~50000 sequenced cells, requiring more than 60Gb of RAM (we recommend to use High Performance Computers). If you want to run our example on a regular PC or a laptop, you can use a randomly chosen number of cells:

df_expr.sample(n=5000, axis=1)

Output

All the output files are saved in output directory inside the directory where the demo.py script is. If you specify any other directory, the results will be generetaed in it. If you do not provide any directory the results will appear in the root where the script was executed.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.3.7.6

Mar 26, 2021

1.3.7.5

Mar 12, 2021

1.3.7.4

Mar 7, 2021

1.3.7.3

Mar 7, 2021

1.3.7.2

Mar 3, 2021

1.3.7.1

Mar 2, 2021

1.3.7

Nov 26, 2020

1.3.6.9

Aug 3, 2020

1.3.6.8

Aug 3, 2020

1.3.6.7

Aug 3, 2020

1.3.6.6

Jul 23, 2020

1.3.6.5

Jul 20, 2020

1.3.6.4

Jul 19, 2020

1.3.6.3

Jul 13, 2020

1.3.6.2

Jul 7, 2020

1.3.6.1

Jul 7, 2020

1.3.6

Jul 6, 2020

1.3.5

Jun 24, 2020

1.3.4.11

Jun 11, 2020

1.3.4.10

Jun 11, 2020

1.3.4.9

Jun 9, 2020

1.3.4.8

Jun 2, 2020

1.3.4.7

Jun 1, 2020

1.3.4.5

May 26, 2020

1.3.4.4

May 25, 2020

1.3.4.3

May 25, 2020

1.3.4.2

May 23, 2020

1.3.4.1

May 21, 2020

1.3.4

May 20, 2020

This version

1.3.3

May 18, 2020

1.3.2

May 7, 2020

1.3.1

Apr 13, 2020

1.3.0

Apr 13, 2020

1.2.3

Nov 11, 2019

1.2.2

Oct 2, 2019

1.2.1

Oct 2, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DigitalCellSorter-1.3.3.tar.gz (6.4 MB view hashes)

Uploaded May 18, 2020 Source

Built Distribution

DigitalCellSorter-1.3.3-py3-none-any.whl (6.5 MB view hashes)

Uploaded May 18, 2020 Python 3

Hashes for DigitalCellSorter-1.3.3.tar.gz

Hashes for DigitalCellSorter-1.3.3.tar.gz
Algorithm	Hash digest
SHA256	`18ebd97535a6e0176155c43cde523dc20493e76a9b0da9a3b8f4daa94912282c`
MD5	`4ca797460baf8ad7729d457cca6e0b3b`
BLAKE2b-256	`452e03d46cebd835e156705eccc49e96a3d0567e431141f4441ce9d5972d35b8`

Hashes for DigitalCellSorter-1.3.3-py3-none-any.whl

Hashes for DigitalCellSorter-1.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56caaee310ba5d071bbf0f01a0518a4652083c560bad9ebe376ed0542d608975`
MD5	`2695446508cf9ac8dc49d4b0e27158fc`
BLAKE2b-256	`48dfff05adb11169e3f77f167f3649d994b9b349b066a3e3c7a16b89d44daf63`