pyscenic

Python implementation of the SCENIC pipeline for transcription factor inference from single-cell transcriptomics experiments.

These details have not been verified by PyPI

Project links

Homepage

Project description

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-CEll regulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.

The pioneering work was done in R and results were published in Nature Methods [1].

pySCENIC can be run on a single desktop machine but easily scales to multi-core clusters to analyze thousands of cells in no time. The latter is achieved via the dask framework for distributed computing [2].

The pipeline has three steps:

First transcription factors (TFs) and their target genes, i.e. targetomes, are derived using gene inference methods which solely rely on correlations between expression of genes across cells. The arboretum package is used for this step.
These targetomes are refined by pruning targets that do not have an enrichment for a corresponding motif of the TF effectively separating direct from indirect targets based on the presence of cis-regulatory footprints.
Finally, the original cells are differentiated and clustered on the activity of these discovered targetomes.

Features

All the functionality of the original R implementation is available and in addition:

You can leverage multi-core and multi-node clusters using dask and its distributed scheduler.
We implemented a version of the recovery of input genes that takes into account weights associated with these genes.
Regulomes with targets that are repressed are now also derived and used for cell enrichment analysis.

Installation

The package itself can be installed via pip install pyscenic.

You can also install this package directly from the source:

git clone https://github.com/aertslab/pySCENIC.git
cd pySCENIC/
pip install .

To successfully use this pipeline you also need auxilliary datasets:

Databases ranking the whole genome of your species of interest based on regulatory features (i.e. transcription factors). Ranking databases are typically stored in the feather format.
Motif annotation database providing the missing link between an enriched motif and the transcription factor that binds this motif. This pipeline needs a TSV text file where every line represents a particular annotation.

To acquire these datasets please contact LCB.

Tutorial

For this tutorial 3,005 single cell transcriptomes taken from the mouse brain (somatosensory cortex and hippocampal regions) are used as an example [3]. The analysis is done in a Jupyter notebook.

First we import the necessary modules and declare some constants:

import os
import pandas as pd
import numpy as np

from arboretum.utils import load_tf_names
from arboretum.algo import grnboost2

from pyscenic.rnkdb import FeatherRankingDatabase as RankingDatabase
from pyscenic.utils import modules_from_adjacencies, save_to_yaml
from pyscenic.prune import prune, prune2df
from pyscenic.aucell import aucell

import seaborn as sns

DATA_FOLDER="~/tmp"
RESOURCES_FOLDER="~/resources"
DATABASE_FOLDER = "~/databases/"
FEATHER_GLOB = os.path.join(DATABASE_FOLDER, "mm9-*.feather")
MOTIF_ANNOTATIONS_FNAME = os.path.join(RESOURCES_FOLDER, "motifs-v9-nr.mgi-m0.001-o0.0.tbl")
MM_TFS_FNAME = os.path.join(RESOURCES_FOLDER, 'mm_tfs.txt')
SC_EXP_FNAME = os.path.join(RESOURCES_FOLDER, "GSE60361_C1-3005-Expression.txt")
REGULOMES_FNAME = os.path.join(DATA_FOLDER, "regulomes.yaml")
NOMENCLATURE = "MGI"

Preliminary work

The scRNA-Seq data is downloaded from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60361 and loaded into memory:

ex_matrix = pd.read_csv(SC_EXP_FNAME, sep='\t', header=0, index_col=0)

Subsequently duplicate genes are removed:

ex_matrix = ex_matrix[~ex_matrix.index.duplicated(keep='first')]
ex_matrix.shape

(19970, 3005)

and the list of Transcription Factors (TF) for Mus musculus are read from file. The list of known TFs for Mm was prepared from TFCat (cf. notebooks section).

tf_names = load_tf_names(MM_TFS_FNAME)

Finally the ranking databases are loaded:

db_fnames = glob.glob(FEATHER_GLOB)
def name(fname):
    return os.path.basename(fname).split(".")[0]
dbs = [RankingDatabase(fname=fname, name=name(fname), nomenclature="MGI") for fname in db_fnames]
dbs

[FeatherRankingDatabase(name="mm9-tss-centered-10kb-10species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-500bp-upstream-7species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-500bp-upstream-10species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-10species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-tss-centered-10kb-7species",nomenclature=MGI),
 FeatherRankingDatabase(name="mm9-tss-centered-5kb-7species",nomenclature=MGI)]

Phase I: Inference of co-expression modules

In the initial phase of the pySCENIC pipeline the single cell expression profiles are used to infer co-expression modules from.

Run GENIE3 or GRNBoost from arboretum to infer co-expression modules

The arboretum package is used for this phase of the pipeline. For this notebook only a sample of 1,000 cells is used for the co-expression module inference is used.

N_SAMPLES = ex_matrix.shape[1] # Full dataset
adjancencies = grnboost2(expression_data=ex_matrix.T.sample(n=N_SAMPLES, replace=False),
                    tf_names=tf_names, verbose=True)

Derive potential regulomes from these co-expression modules

Regulomes are derived from adjacencies based on three methods.

The first method to create the TF-modules is to select the best targets for each transcription factor:

Targets with weight > 0.001
Targets with weight > 0.005

The second method is to select the top targets for a given TF:

Top 50 targets (targets with highest weight)

The alternative way to create the TF-modules is to select the best regulators for each gene (this is actually how GENIE3 internally works). Then, these targets can be assigned back to each TF to form the TF-modules. In this way we will create three more gene-sets:

Targets for which the TF is within its top 5 regulators
Targets for which the TF is within its top 10 regulators
Targets for which the TF is within its top 50 regulators

A distinction is made between modules which contain targets that are being activated and genes that are being repressed. Relationship between TF and its target, i.e. activator or repressor, is derived using the original expression profiles. The Pearson product-moment correlation coefficient is used to derive this information.

In addition, the transcription factor is added to the module and modules that have less than 20 genes are removed.

modules = list(modules_from_adjacencies(adjacencies, ex_matrix, nomenclature=NOMENCLATURE))

Phase II: Prune modules for targets with cis regulatory footprints (aka RcisTarget)

df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME)
regulomes = df2regulomes(df, NOMENCLATURE)

Directly calculating regulomes without the intermediate dataframe of enriched features is also possible:

regulomes = prune(dbs, modules, MOTIF_ANNOTATIONS_FNAME)
save_to_yaml(regulomes, REGULOMES_FNAME)

Multi-core systems and clusters can leveraged in the following way:

# The fastest multi-core implementation:
df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME,
                    client_or_address="custom_multiprocessing", num_workers=8)
# or alternatively:
regulomes = prune(dbs, modules, MOTIF_ANNOTATIONS_FNAME,
                    client_or_address="custom_multiprocessing", num_workers=8)

# The clusters can be leveraged via the dask framework:
df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME, client_or_address="local")
# or alternatively:
regulomes = prune(dbs, modules, MOTIF_ANNOTATIONS_FNAME, client_or_address="local")

Phase III: Cellular regulome enrichment matrix (aka AUCell)

We characterize the different cells in a single-cell transcriptomics experiment via the enrichment of the previously discovered regulomes. Enrichment of a regulome is measured as the Area Under the recovery Curve (AUC) of the genes that define this regulome.

auc_mtx = aucell(ex_matrix.T, regulomes, num_workers=4)
sns.clustermap(auc_mtx, figsize=(8,8))

Command Line Interface

A command line version of the tool is included. This tool is available after proper installation of the package via pip.

{ ~ }  » pyscenic                                            ~
usage: SCENIC - Single-CEll regulatory Network Inference and Clustering
           [-h] [-o OUTPUT] {grn,motifs,prune,aucell} ...

positional arguments:
  {grn,motifs,prune,aucell}
                        sub-command help
    grn                 Derive co-expression modules from expression matrix.
    motifs              Find enriched motifs for gene signatures.
    prune               Prune targets from a co-expression module based on
                        cis-regulatory cues.
    aucell              b help

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Output file/stream.

Website

For more information, please visit LCB and SCENIC.

License

GNU General Public License v3

References

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.12.1

Nov 21, 2022

0.12.0

Aug 16, 2022

0.11.2

May 7, 2021

0.11.1

Apr 16, 2021

0.11.0

Feb 10, 2021

0.10.4

Nov 24, 2020

0.10.3

Jul 17, 2020

0.10.2

Jun 5, 2020

0.10.1

May 17, 2020

0.10.0

Feb 27, 2020

0.9.19

Oct 9, 2019

0.9.18

Sep 25, 2019

0.9.17

Sep 19, 2019

0.9.16

Aug 21, 2019

0.9.15

Jul 28, 2019

0.9.14

Jul 12, 2019

0.9.13

Jul 7, 2019

0.9.12

Jul 7, 2019

0.9.11

Jun 23, 2019

0.9.10

Jun 14, 2019

0.9.9

May 10, 2019

0.9.8

Apr 29, 2019

0.9.7

Mar 21, 2019

0.9.6

Mar 10, 2019

0.9.5

Feb 12, 2019

0.9.4

Jan 24, 2019

0.9.3

Jan 16, 2019

0.9.2

Jan 14, 2019

0.9.1

Dec 20, 2018

0.9.0

Dec 18, 2018

0.8.16

Dec 4, 2018

0.8.15

Dec 4, 2018

0.8.14

Nov 29, 2018

0.8.13

Nov 28, 2018

0.8.12

Nov 26, 2018

0.8.11

Nov 5, 2018

0.8.10

Nov 5, 2018

0.8.9

Aug 22, 2018

0.8.8

Aug 2, 2018

0.8.7

Jul 12, 2018

0.8.6

Jun 27, 2018

0.8.5

Jun 14, 2018

0.8.4

May 3, 2018

0.8.3

May 2, 2018

0.8.2

May 1, 2018

0.8.1

Apr 28, 2018

0.8.0

Apr 27, 2018

0.7.2

Apr 23, 2018

0.7.1

Apr 18, 2018

0.7.0

Apr 17, 2018

0.6.14

Apr 5, 2018

0.6.12

Mar 27, 2018

0.6.11

Mar 26, 2018

0.6.10

Mar 23, 2018

0.6.9

Mar 22, 2018

0.6.8

Mar 22, 2018

0.6.7

Mar 20, 2018

0.6.6

Mar 20, 2018

This version

0.6.5

Mar 19, 2018

0.6.4

Mar 17, 2018

0.6.3

Mar 16, 2018

0.6.2

Mar 16, 2018

0.6.1

Mar 16, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyscenic-0.6.5.tar.gz (3.6 MB view details)

Uploaded Mar 19, 2018 Source

Built Distribution

pyscenic-0.6.5-py3-none-any.whl (3.6 MB view details)

Uploaded Mar 19, 2018 Python 3

File details

Details for the file pyscenic-0.6.5.tar.gz.

File metadata

Download URL: pyscenic-0.6.5.tar.gz
Upload date: Mar 19, 2018
Size: 3.6 MB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for pyscenic-0.6.5.tar.gz
Algorithm	Hash digest
SHA256	`0674ca9b4ec0b0219846124621d8690cec7d657610bd2ebbf0de04eff562d0b3`
MD5	`528e7a60fd09841fc36adef23cacb9ef`
BLAKE2b-256	`993797273c446c580e030905eda1fef9f18130764914010849b1e048f3bef85c`

See more details on using hashes here.

File details

Details for the file pyscenic-0.6.5-py3-none-any.whl.

File metadata

Download URL: pyscenic-0.6.5-py3-none-any.whl
Upload date: Mar 19, 2018
Size: 3.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No

File hashes

Hashes for pyscenic-0.6.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9917a621b9ec5c45808491ab3f40275b308a1991728ca78424b53e802bf4eb17`
MD5	`4c3af1aab0261abe952a8a63d272e3f5`
BLAKE2b-256	`7de1a2fd5f03ce1cc2b7c1089862152c35d0bd7fd3380b59490b9e5db8125d47`

See more details on using hashes here.

pyscenic 0.6.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Installation

Tutorial

Preliminary work

Phase I: Inference of co-expression modules

Run GENIE3 or GRNBoost from arboretum to infer co-expression modules

Derive potential regulomes from these co-expression modules

Phase II: Prune modules for targets with cis regulatory footprints (aka RcisTarget)

Phase III: Cellular regulome enrichment matrix (aka AUCell)

Command Line Interface

Website

License

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes