Skip to main content

Single-Cell Analysis in Python.

Project description

Getting started | Features | Installation | References

Build Status

Scanpy – Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. The Python-based implementation efficiently deals with data sets of more than one million cells and enables easy integration of advanced machine learning algorithms.

Getting started

With Python 3.5 or 3.6 installed, get releases on PyPI via (more information on installation here):

pip install scanpy

To work with the latest version on GitHub: clone the repository – green button on top of the page – and cd into its root directory and type:

pip install --editable .

You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.

Then go through the use cases compiled in scanpy_usage, in particular, the recent additions

17-05-05

We reproduce most of the Guided Clustering tutorial of Seurat [Satija15].

17-05-03

Analyzing 68 000 cells from [Zheng17], we find that Scanpy is about a factor 5 to 16 faster and more memory efficient than the Cell Ranger R kit for secondary analysis.

17-05-02

We reproduce the results of the Diffusion Pseudotime (DPT) paper of [Haghverdi16]. Note that DPT has recently been very favorably discussed by the authors of Monocle.

Features

Let us give an Overview of the toplevel user functions, followed by a few words on Scanpy’s Basic Features and more details.

Overview

Scanpy user functions are grouped into the following modules

sc.tools

Machine Learning and statistics tools. Abbreviation sc.tl.

sc.preprocessing

Preprocessing. Abbreviation sc.pp.

sc.plotting

Plotting. Abbreviation sc.pl.

sc.settings

Settings.

Preprocessing
pp.*

Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization, preprocessing recipes.

Visualizations
tl.pca

PCA [Pedregosa11].

tl.diffmap

Diffusion Maps [Coifman05] [Haghverdi15] [Wolf17].

tl.tsne

t-SNE [Maaten08] [Amir13] [Pedregosa11].

tl.draw_graph

Force-directed graph drawing [Csardi06] [Weinreb17].

Branching trajectories and pseudotime, clustering, differential expression
tl.dpt

Infer progression of cells, identify branching subgroups [Haghverdi16] [Wolf17].

tl.louvain

Cluster cells into subgroups [Blondel08] [Traag17].

tl.rank_genes_groups

Rank genes according to differential expression [Wolf17].

Simulations
tl.sim

Simulate dynamic gene expression data [Wittmann09] [Wolf17].

Basic Features

The typical workflow consists of subsequent calls of data analysis tools of the form:

sc.tl.tool(adata, **params)

where adata is an AnnData object and params is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None. If you want to copy the AnnData object, pass the copy argument:

adata_copy = sc.tl.tool(adata, copy=True, **params)
Reading and writing data files and AnnData objects

One usually calls:

adata = sc.read(filename)

to initialize an AnnData object, possibly adds further annotation using, e.g., np.genfromtxt or pd.read_csv:

annotation = pd.read_csv(filename_annotation)
adata.smp['cell_groups'] = annotation['cell_groups']  # categorical annotation of type str or int
adata.smp['time'] = annotation['time']                # numerical annotation of type float

and uses:

sc.write(filename, adata)

to save the adata to file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Instead of providing a filename, you can provide a filekey, i.e., any string that does not end on a valid file extension.

AnnData objects

An AnnData instance stores an array-like data matrix as adata.X, dict-like sample annotation as adata.smp, dict-like variable annotation as adata.var and additional unstructured dict-like annotation as adata.add. While adata.add is a conventional dictionary, adata.smp and adata.var are instances of a low-level Pandas dataframe-like class.

Values can be retrieved and appended via adata.smp[key] and adata.var[key]. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R’s ExpressionSet [Huber15] the latter though is not implemented for sparse data.

Plotting

For each tool, there is an associated plotting function:

sc.pl.tool(adata)

that retrieves and plots the elements of adata that were previously written by sc.tl.tool(adata). Scanpy’s plotting module can be viewed similar to Seaborn: an extension of matplotlib that allows visualizing operations on AnnData objects with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy’s plotting functions accept and return a Matplotlib.Axes object.

Visualization

pca

[source] Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn [Pedregosa11].

tsne

[source] t-distributed stochastic neighborhood embedding (tSNE) [Maaten08] has been proposed for single-cell data by [Amir13]. By default, Scanpy uses the implementation of scikit-learn [Pedregosa11]. You can achieve a huge speedup if you install Multicore-tSNE by [Ulyanov16], which will be automatically detected by Scanpy.

diffmap

[source] Diffusion maps [Coifman05] has been proposed for visualizing single-cell data by [Haghverdi15]. The tool uses the adapted Gaussian kernel suggested by [Haghverdi16]. Uses the implementation of [Wolf17].

draw_graph

[source] Force-directed graph drawing describes a class of long-established algorithms for visualizing graphs. It has been suggested for visualizing single-cell data by [Weinreb17]. Here, by default, the Fruchterman & Reingold [Fruchterman91] algorithm is used; many other layouts are available. Uses the igraph implementation [Csardi06].

Discrete clustering of subgroups, continuous progression through subgroups, differential expression

dpt

[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by [Haghverdi16]. Here, we use a further developed version, which is able to detect multiple branching events [Wolf17].

The possibilities of diffmap and dpt are similar to those of the R package destiny of [Angerer16]. The Scanpy tools though run faster and scale to much higher cell numbers.

Examples: See this use case.

louvain

[source] Cluster cells using the Louvain algorithm [Blondel08] in the implementation of [Traag17]. The Louvain algorithm has been proposed for single-cell analysis by [Levine15].

Examples: See this use case.

rank_genes_groups

[source] Rank genes by differential expression.

Examples: See this use case.

Simulation

sim

[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by [Wittmann09]. The Scanpy implementation is due to [Wolf17].

The tool is similar to the Matlab tool Odefy of [Krumsiek10].

Examples: See this use case.

Installation

If you use Windows or Mac OS X and do not have Python 3.5 or 3.6, download and install Miniconda (see below). If you use Linux, use your package manager to obtain a current Python distribution.

Get releases on PyPI via:

pip install scanpy

To work with the latest version on GitHub: clone the repository – green button on top of the page – and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call:

pip install --editable .

You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.

Trouble shooting

If you have both python 2 and python 3 installed:

pip3 install scanpy

If you do not have sudo rights (you get a Permission denied error):

pip install --user scanpy

On MacOS, you probably need to install the C core of igraph via homebrew first

  • brew install igraph

  • If python-igraph still fails to install, see here or consider installing gcc via brew install gcc --without-multilib and exporting export CC="/usr/local/Cellar/gcc/X.x.x/bin/gcc-X"; export CXX="/usr/local/Cellar/gcc/X.x.x/bin/gcc-X", where X and x refers to the version of gcc; in my case, the path reads /usr/local/Cellar/gcc/6.3.0_1/bin/gcc-6.

References

[Amir13] (1,2)

Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia, Nature Biotechnology.

[Angerer16]

Angerer et al. (2016), destiny – diffusion maps for large-scale single-cell data in R, Bioinformatics.

[Blondel08] (1,2)

Blondel et al. (2008), Fast unfolding of communities in large networks, J. Stat. Mech..

[Coifman05] (1,2)

Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS.

[Csardi06] (1,2)

Csardi et al. (2006), The igraph software package for complex network researc, InterJournal Complex Systems.

[Ester96]

Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR.

[Fruchterman91]

Fruchterman & Reingold (1991), Graph drawing by force-directed placement, Software: Practice & Experience.

[Hagberg08]

Hagberg et al. (2008), Exploring Network Structure, Dynamics, and Function using NetworkX, Scipy Conference.

[Haghverdi15] (1,2)

Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics.

[Haghverdi16] (1,2,3,4)

Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods.

[Huber15]

Huber et al. (2015), Orchestrating high-throughput genomic analysis with Bioconductor, Nature Methods.

[Krumsiek10]

Krumsiek et al. (2010), Odefy – From discrete to continuous models, BMC Bioinformatics.

[Krumsiek11]

Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE.

[Levine15]

Levine et al. (2015), Data-Driven Phenotypic Dissection of AML Reveals Progenitor–like Cells that Correlate with Prognosis, Cell.

[Maaten08] (1,2)

Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR.

[Satija15]

Satija et al. (2015), Spatial reconstruction of single-cell gene expression data, Nature Biotechnology.

[Moignard15]

Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology.

[Pedregosa11] (1,2,3,4)

Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR.

[Paul15]

Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell.

[Traag17] (1,2)

Traag (2017), Louvain, GitHub.

[Ulyanov16]

Ulyanov (2016), Multicore t-SNE, GitHub.

[Weinreb17] (1,2)

Weinreb et al. (2016), SPRING: a kinetic interface for visualizing high dimensional single-cell expression data, bioRXiv.

[Wittmann09] (1,2)

Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology.

[Wolf17] (1,2,3,4,5,6,7)

Wolf et al (2017), TBD.

[Zheng17]

Zheng et al. (2017), Massively parallel digital transcriptional profiling of single cells, Nature Communications.

Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scanpy-0.2.5.tar.gz (204.3 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page