Skip to main content

Single-Cell Analysis in Python.

Project description

Getting started | Examples | Docs | Installation | References

Build Status

Scanpy – Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data. It includes preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing and simulation of gene regulatory networks. The Python-based implementation efficiently deals with data sets of more than one million cells and enables easy integration of advanced machine learning algorithms.

For conceptual ideas and context, see our draft; comments are highly appreciated.

Getting started

With Python 3.5 or 3.6 installed, get releases on PyPI via (more information on installation here):

pip3 install scanpy

To work with the latest version on GitHub: clone the repository – green button on top of the page – and cd into its root directory and type:

pip3 install -e .

You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.


Examples are collected in the repo scanpy_usage. Good starting points are the following use cases:

We reproduce most of the Guided Clustering tutorial of Seurat [Satija15].
Analyzing 68 000 cells from [Zheng17], we find that Scanpy is about a factor 5 to 16 faster and more memory efficient than the Cell Ranger R kit for secondary analysis.
We reproduce the results of the Diffusion Pseudotime (DPT) paper of [Haghverdi16]. Note that DPT has recently been very favorably discussed by the authors of Monocle.


Here, we give an Overview of the toplevel user functions, describe Basic Features and the context of the Tools. For detailed help on the functions, use Python’s help. A separate docs page will soon be established.


Scanpy user functions are grouped into the following modules
Machine Learning and statistics tools. Abbreviation
Preprocessing. Abbreviation sc.pp.
Plotting. Abbreviation
Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization, preprocessing recipes.
Branching trajectories and pseudotime, clustering, differential expression
Infer progression of cells, identify branching subgroups [Haghverdi16] [Wolf17].
Cluster cells into subgroups [Blondel08] [Levine15] [Traag17].
Rank genes according to differential expression [Wolf17].
Simulate dynamic gene expression data [Wittmann09] [Wolf17].

Basic Features

The typical workflow consists of subsequent calls of data analysis tools of the form:, **params)

where adata is an AnnData object and params are optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. To facilitate writing memory-efficient pipelines, by default, Scanpy tools operate inplace on adata and return None. If you want to copy the AnnData object, pass the copy argument:

adata_copy =, copy=True, **params)
Reading and writing data files and AnnData objects

One usually calls:

adata =

to initialize an AnnData object, possibly adds further annotation using, e.g., np.genfromtxt or pd.read_csv:

annotation = pd.read_csv(filename_annotation)
adata.smp['cell_groups'] = annotation['cell_groups']  # categorical annotation of type str or int
adata.smp['time'] = annotation['time']                # numerical annotation of type float

and uses:

sc.write(filename, adata)

to save the adata as a collection of data arrays to a file in a platform and language-independent way. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Instead of providing a filename, you can provide a filekey, i.e., any string that does not end on a valid file extension.

AnnData objects

An AnnData instance stores an array-like data matrix as adata.X, dict-like sample annotation as adata.smp, dict-like variable annotation as adata.var and additional unstructured dict-like annotation as adata.add. While adata.add is a conventional dictionary, adata.smp and adata.var are instances of a low-level Pandas dataframe-like class.

Values can be retrieved and appended via adata.smp[key] and adata.var[key]. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R’s ExpressionSet [Huber15] the latter though is not implemented for sparse data.


For each tool, there is an associated plotting function:

that retrieves and plots annotation in adata that has been added by Scanpy’s plotting module can be viewed similar to Seaborn: an extension of matplotlib that allows visualizing operations on AnnData objects with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy’s plotting functions accept and return a Matplotlib.Axes object.



[source] Computes PCA coordinates, loadings and variance decomposition. Uses the implementation of scikit-learn [Pedregosa11].


[source] t-distributed stochastic neighborhood embedding (tSNE) [Maaten08] has been proposed for single-cell data by [Amir13]. By default, Scanpy uses the implementation of scikit-learn [Pedregosa11]. You can achieve a huge speedup if you install Multicore-tSNE by [Ulyanov16], which will be automatically detected by Scanpy.


[source] Diffusion maps [Coifman05] has been proposed for visualizing single-cell data by [Haghverdi15]. The tool uses the adapted Gaussian kernel suggested by [Haghverdi16]. Uses the implementation of [Wolf17].


[source] Force-directed graph drawing describes a class of long-established algorithms for visualizing graphs. It has been suggested for visualizing single-cell data by [Weinreb17]. Here, by default, the Fruchterman & Reingold [Fruchterman91] algorithm is used; many other layouts are available. Uses the igraph implementation [Csardi06].

Discrete clustering of subgroups, continuous progression through subgroups, differential expression


[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by [Haghverdi16]. Here, we use a further developed version, which is able to detect multiple branching events [Wolf17].

The possibilities of diffmap and dpt are similar to those of the R package destiny of [Angerer16]. The Scanpy tools though run faster and scale to much higher cell numbers.

Examples: See this use case.


[source] Cluster cells using the Louvain algorithm [Blondel08] in the implementation of [Traag17]. The Louvain algorithm has been proposed for single-cell analysis by [Levine15].

Examples: See this use case.


[source] Rank genes by differential expression.

Examples: See this use case.



[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by [Wittmann09]. The Scanpy implementation is due to [Wolf17].

The tool is similar to the Matlab tool Odefy of [Krumsiek10].

Examples: See this use case.


If you use Windows or Mac OS X and do not have Python 3.5 or 3.6, download and install Miniconda (see below). If you use Linux, use your package manager to obtain a current Python distribution.

Get releases on PyPI via:

pip3 install scanpy

To work with the latest version on GitHub: clone the repository – green button on top of the page – and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call:

pip3 install -e .

You can now import scanpy.api as sc anywhere on your system and work with the command scanpy on the command-line.

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

chmod +x

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.

Trouble shooting

If you do not have sudo rights (you get a Permission denied error):

pip install --user scanpy

On MacOS, you probably need to install the C core of igraph via homebrew first

  • brew install igraph
  • If python-igraph still fails to install, see here or consider installing gcc via brew install gcc --without-multilib and exporting export CC="/usr/local/Cellar/gcc/X.x.x/bin/gcc-X"; export CXX="/usr/local/Cellar/gcc/X.x.x/bin/gcc-X", where X and x refers to the version of gcc; in my case, the path reads /usr/local/Cellar/gcc/6.3.0_1/bin/gcc-6.


[Amir13](1, 2) Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia, Nature Biotechnology.
[Angerer16]Angerer et al. (2016), destiny – diffusion maps for large-scale single-cell data in R, Bioinformatics.
[Blondel08](1, 2) Blondel et al. (2008), Fast unfolding of communities in large networks, J. Stat. Mech..
[Coifman05](1, 2) Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS.
[Csardi06](1, 2) Csardi et al. (2006), The igraph software package for complex network researc, InterJournal Complex Systems.
[Ester96]Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR.
[Fruchterman91](1, 2) Fruchterman & Reingold (1991), Graph drawing by force-directed placement, Software: Practice & Experience.
[Hagberg08]Hagberg et al. (2008), Exploring Network Structure, Dynamics, and Function using NetworkX, Scipy Conference.
[Haghverdi15](1, 2) Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics.
[Haghverdi16](1, 2, 3, 4) Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods.
[Huber15]Huber et al. (2015), Orchestrating high-throughput genomic analysis with Bioconductor, Nature Methods.
[Krumsiek10]Krumsiek et al. (2010), Odefy – From discrete to continuous models, BMC Bioinformatics.
[Krumsiek11]Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE.
[Levine15](1, 2) Levine et al. (2015), Data-Driven Phenotypic Dissection of AML Reveals Progenitor–like Cells that Correlate with Prognosis, Cell.
[Maaten08](1, 2) Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR.
[Satija15]Satija et al. (2015), Spatial reconstruction of single-cell gene expression data, Nature Biotechnology.
[Moignard15]Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology.
[Pedregosa11](1, 2, 3, 4) Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR.
[Paul15]Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell.
[Traag17](1, 2) Traag (2017), Louvain, GitHub.
[Ulyanov16]Ulyanov (2016), Multicore t-SNE, GitHub.
[Weinreb17](1, 2) Weinreb et al. (2016), SPRING: a kinetic interface for visualizing high dimensional single-cell expression data, bioRXiv.
[Wittmann09](1, 2) Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology.
[Wolf17](1, 2, 3, 4, 5, 6, 7) Wolf et al (2017), TBD.
[Zheng17]Zheng et al. (2017), Massively parallel digital transcriptional profiling of single cells, Nature Communications.

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
scanpy-0.2.6-cp36-cp36m-manylinux1_x86_64.whl (229.1 kB) Copy SHA256 hash SHA256 Wheel cp36
scanpy-0.2.6.tar.gz (204.7 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page