Functional enrichment terms aggregator.
Project description
restring
Easy to use functional enrichment terms retriever and aggregator, designed for the wet biology researcher
Overview
restring
works on user-supplied differentially expressed (DE) genes list, and automatically pulls and aggregates functional enrichment data from STRING.
It returns, in table-friendly format, aggregated results from analyses of multiple comparisons.
Results can readily be visualized via highly customizable heat/clustermaps to produce beautiful publication-grade pics. Plus, it's got a GUI!
What KEGG pathway was found in which comparisons? What pvalues? What DE genes annotated in that pathway were shared in those comparisons? How can I simultaneously show results for all my experimental groups, for all terms, all at once? This can all be managed by restring
.
Use case
Modern high-throughput -omic approaches generate huge lists of differentially expressed (DE) genes/proteins, which can in turn be used for functional enrichment studies. Manualy reviewing a large number of such analyses is time consuming, especially for experimental designs with more than a few groups. Let's consider this experimental setup:
This represents a fairly common experimental design, but manually inspecting functional enrichment results for such all possible combinations would require substantial effort. Let's take a look at how we can tackle this issue with restring
.
Our sample experimental setup has two treatments, given at two time points to three different sample types. Let's assume those samples are cells of different genotypes, and we'd like to mainly investigate genotype comparisons. After quantifying gene expression by RNAseq, we have DE genes for every comparison. As in many experimental pipelines, each list of DE genes is investigated with functional enrichment tools, such as String. But every comparison generates one or more tables. restring
makes it easy to generate summary reports from all of them, automatically.
Installation
The installation via pip is still work in progress, as is this installation section (short but absolutely complete). Yet you can download restring
repo and unzip everything inside a folder. That's it.
Of course you need to have a Python3 installation that your OS knows about, plus all dependencies. Quickest way: get Anaconda. You'll likely need to manually install something else:
pip install seaborn
To run restring, simply open a terminal in the folder where you unzipped everything and type:
python resting.py
The program should start, this is what it looks like in MacOS:
Procedure
restring
can be used via its graphical user interface (recommended). A full protocol, with sample data and examples, is detailed below.
Alternatively, it can be imported as a Python module. This hands-on procedure is detailed at the end of this document.
restring
GUI
1. Prepping the files
All restring
requires is a gene list of choice per experimental condition. This gene list needs to be in tabular form, arranged like this sample data. This is very easily managed with any spreadsheet editor, such as Microsoft's Excel or Libre Office's Calc.
2. Set input files and output path
In the menu, choose File > Open...
, or hit the Open files..
button.
Then, choose an existing directory where all putput files will be placed: choose File > Set output folder
or hit Set folder
button (Choose a different output folder each time the analysis parameters are varied, see section 5).
3. Running the analysis (with default settings)
In the menu, choose Analysis > New analysis
, or hit the New analysis
button.
restring
will look for genes in the files you have specified, interrogate STRING to get functional enrichment data back (these tables, looking exactly the same to the ones you would manually retrieve, will be saved into subfolders of the output folder), then write aggregated results and summaries.
These are found in the specified output directory, and take the form of results- or summary-type tables, in .tsv
(tab separated values) format, that can be opened out-of-the-box by Excel or Calc. Let's take a look at the anatomy of these tables.
Results tables
The table contains all terms cumulatively retrieved from all comparisons (each one of the inpt files containing the genes of interest between any two experimental conditions). For every term, common genes (if any) are listed. These common genes only include comparisons where the term actually shows up. If the term just appears in exactly one comparison, this is explicitly stated: n/a (just one condition)
. P-values are the ones retrieved from the STRING tables (the lower, the better). Missing p-values are represented with 1
(that is, in that specific comparison the term is 100% likely not enriched).
Summary tables
These can be useful to find the most interesting terms across all comparisons: better p-value, presence in most/selected comparisons), as well as finding the most recurring DE genes for each term.
4. Visualizing the results
restring
makes it easy to inspect the results by visualizing results-type tables as clustermaps.
In the menu, choose Analysis > Draw clustermap
to open the Draw clustermap window:
Options
readable: if flagged, the output clustermap will be drawn as tall as required to fully display all the terms contained in it. Be warned that this might get very tall, depending on the number of terms.
log transform: if flagged, the p-values are minus log-transformed with the specified base: -log(number, base chosen). Hit Apply
to apply.
cluster rows: if flagged, the rows are clustered (by distance) as per Scipy defaults. The column order is overridden.
cluster columns: if flagged, the columns are clustered (by distance) as per Scipy defaults.
P-value cutoff: For each term (row), if all values are higher than the specified threshold, the term is not included in the clustermap. For log-transformed heatmaps, for each term (row), if all values are lower than the specified threshold, the term is not included in the clustermap.
Insert a new value and hit Apply
to see how many terms are retained/discarded by the new threshold.
Note that the default value of 1
will include all terms of a non-transformed table, as all terms are necessarily 1 or lower (moreover, there should automatically be at least a term per row that was significant, at P=0.05, in the files retrieved from STRING, otherwise the term would not appear in the table in the first place).
To set a new threshold, for instance at P=0.001, one should input 0.001
, or 3
when log-transforming in base 10. Always hit Apply
.
Log base: choose the base for the logarithm.
DPI: choose the output image resolution in DPI (dot per inch). The higher, the larger the image.
Apply: Applies the current settings to the table, and shows how the settings impact on the table.
Choose terms..: This button opens a dialog to choose the terms. An example:
In this example, the results table contains terms that are irrelevant in the analysis being made. When loading a new table, all terms are automatically included, but the user chan choose to untick the terms that are unwanted. If a new P-value cutoff is applied, restring
remembers the user choice even if some of the terms are now removed from the term list and are added back to the table at a later time.
Hit Apply & OK
to apply the choice and close the window.
Choose col order: The user can reorder the column order by dragging the column names. Multiple adjacent columns can be selected and dragged together (this is ineffective if Cluster rows is flagged).
Hit OK
to apply and close the window.
Draw clustermap: Draws, saves and opens the clustermap.
Reset: Reloads the input table and clears term selection.
Help: Opens a dialog that briefly outlines the procedure. (Currently not implemented).
Help: Opens the default browser at this repo's main page. (Currently not implemented).
Close: Closes the window.
5. Configuring the analysis.
Species
restring
defaults to Mus musculus. To choose another species, choose Analysis > Set species
to open the dialog:
STRING accepts species in the form of taxonomy IDs. Hit the button of your species of choice or supply a custom TaxID and hit Set
.
Head over to STRING's doc to know if your species is supported.
DE genes settings
restring
accepts as gene lists input something like this sample data.
The input contains information of the gene name (we developed restring
having the official gene name in mind as the preferred gene identifier, as that's always the case among researchers in our experience) and information about the direction of the change of expression with respect to experimental groups. This is the implied convention:
gene ID | cond1 | cond2 | cond1_vs_cond2 | log2FC |
---------|---------------------------------------------|
gene1 | 143 | 748 | 0.191 | -2.38 |
---------|---------------------------------------------|
gene2 | 50 | 4 | 12.5 | 3.64 |
In this example, cond1
and cond2
are the two experimental condition where the abundance of the transcript has been estimated. Every RNAseq analysis contains at least the log2FC (base 2, log-transformed ratio of the expression values), that tells if the gene is upregulated or downregulated.
restring
follows this convention:
UP is upregulated in cond1
versus cond2
. That's the case of gene2
. Log2FC
is > 0.
DOWN is downregulated in cond1
versus cond2
. That's the case of gene1
. Log2FC
is < 0.
restring
does not actually care about the magnitude of Log2FC
: that's from the pre-processing of the genes that the researcher is interested in.
Depending on how to treat the directionality information, there are four types of different analyses:
Upregulated genes only: Functional enrichment info is searched for upregulated genes only. Enrichment is performed on upregulated genes only (Arrange the input data so as "upregulated" in your experiment matches the implied convention).
Downregulated genes only: Functional enrichment info is searched for downregulated genes only. Enrichment is performed on downregulated genes only (Arrange the input data so as "downregulated" in your experiment matches the implied convention).
Upregulated and Downregulated, separately: This is the default option. For every comparison, both upregulated and downregulated genes are considered, but separately. This means that functional enrichment info is retrieved for upregulated and downregulated genes separately, but the terms are aggregated from both.
If a term shows up in both UP and DOWN gene lists, then the lowest P-value one is recorded.
All genes together: Functional enrichment info is searched for all genes together, and the resulting aggregation will reflect the functional enrichment analysis retrieved with all genes together (still supply a gene list that has a number, for each gene ID, in the second column. Just write any number.)
Tip: To avoid accumulating STRING files, consider setting a different output folder any time the analysis parameters are varied. Notwithstanding, restring
clearly labels what enrichment files come from which gene lists: UP
, DOWN
or ALL
are prepended to each table retrieved from STRING.
restring
as a Python module
1. Prepping the files
Head over to String, and analyze your gene/protein list. Please refer to the String documentation for help. After running the analysis, hit the button at the bottom of the page. This allows to download the results as tab delimited text files.
restring
is designed to work with the results types highlighted in green. For each one of your experimental settings, create a folder with a name that will serve as a label for it. Here's how our sample data is arranged:
Not all comparisons resulted in a DE gene list that's long enough to generate functional enrichment results (see image above), thus a few comparisons (folders) are missing. When the DE gene list was sufficiently long to generate results for all analyses, this is what the folder content looks like (example of one folder):
For each enrichment (KEGG
, Component
, Function
, Process
and RCTM
), we fed String with DE genes that were either up- or downregulated with respect of one of the genotypes of the analysis. restring
makes use of the UP
and DOWN
labels in the filenames to know what direction the analysis went (it's possible to aggregate UP
and DOWN
DE genes together).
It's OK to have folders that don't contain all files (if there were insufficient DE genes to produce some), like in the folder ctrl_t1_KO_vs_DKO
:
2. Aggregating the results
Once everything is set up, we can run restring
to aggregate info from all the sparse results. The following example makes use of the String results that can be found in sample data.
import restring
dirs = restring.get_dirs()
print(dirs)
['ctrl_t0_wt_vs_DKO', 'ctrl_t0_wt_vs_KO', 'ctrl_t1_KO_vs_DKO',
'ctrl_t1_wt_vs_DKO', 'ctrl_t1_wt_vs_KO', 'treatment_t0_wt_vs_DKO',
'treatment_t0_wt_vs_KO', 'treatment_t1_KO_vs_DKO', 'treatment_t1_wt_vs_DKO',
'treatment_t1_wt_vs_KO']
get_dirs()
returns a list
of all folders within the current directory, to the excepion of folders beginning with __
or .
. We can start aggregating results with default parameters (KEGG
pathways for both UP
and DOWN
regulated genes).
db = restring.aggregate_results(dirs)
Start walking the directory structure.
Parameters
----------
folders: 10
kind=KEGG
directions=['UP', 'DOWN']
Processing directory: ctrl_t0_wt_vs_DKO
Processing file DOWN_wt_enrichment.KEGG.tsv
Processing file UP_wt_enrichment.KEGG.tsv
Processing directory: ctrl_t0_wt_vs_KO
Processing file DOWN_wt_enrichment.KEGG.tsv
Processing file UP_wt_enrichment.KEGG.tsv
Processing directory: ctrl_t1_KO_vs_DKO
Processing directory: ctrl_t1_wt_vs_DKO
Processing file DOWN_wt_enrichment.KEGG.tsv
Processing file UP_wt_enrichment.KEGG.tsv
Processing directory: ctrl_t1_wt_vs_KO
Processing file DOWN_wt_enrichment.KEGG.tsv
Processing file UP_wt_enrichment.KEGG.tsv
Processing directory: treatment_t0_wt_vs_DKO
Processing file DOWN_wt_enrichment.KEGG.tsv
Processing file UP_wt_enrichment.KEGG.tsv
Processing directory: treatment_t0_wt_vs_KO
Processing file DOWN_wt_enrichment.KEGG.tsv
Processing directory: treatment_t1_KO_vs_DKO
Processing file UP_KO_enrichment.KEGG.tsv
Processing file DOWN_KO_enrichment.KEGG.tsv
Processing directory: treatment_t1_wt_vs_DKO
Processing file DOWN_wt_enrichment.KEGG.tsv
Processing file UP_wt_enrichment.KEGG.tsv
Processing directory: treatment_t1_wt_vs_KO
Processing file DOWN_wt_enrichment.KEGG.tsv
Processing file UP_wt_enrichment.KEGG.tsv
Processed 10 directories and 17 files.
Found a total of 165 KEGG elements.
Running aggregate_results()
with other parameters is possible:
help(restring.aggregate_results)
# truncated output
Walks the given <directories> list, and reads the String .tsv files of
defined <kind>.
Params:
=======
directories: <list> of directories where to look for String files
kind: <str> Defines the String filetype to process. Kinds defined in
settings.file_types
directions: <list> contaning the up- or down-regulated genes in a comparison.
Info is retrieved from either UP and/or DOWN lists.
* Prerequisite *: generating files form String with UP and/or DOWN regulated
genes separately.
verbose: <bool>; turns verbose mode on or off
The kind
parameter is picked from the 5 supported String result tables:
print(restring.settings.file_types)
('Component', 'Function', 'KEGG', 'Process', 'RCTM')
To manipulate the aggregated results, it's convenient to put them into a table:
df = restring.tableize_aggregated(db)
This functions wraps the results into a handy pandas.DataFrame
object, that can be saved as a table for further inspection:
df.to_csv("results.csv")
The table contains all terms cumulatively retrieved from all comparisons (directories/columns). For every term, common genes (if any) are listed. These common genes only include comparisons where the term actually shows up. If the term just appears in exactly one comparison, this is explicitly stated: n/a (just one condition)
. P-values are the ones retrieved from the String tables (the lower, the better). Missing p-values are represented with 1
(this can be set to anything str()
accepts when calling tableize_aggregated()
with the not_found
parameter).
These 'aggregated' tables are useful for charting the results (see later). There's other info that can be extracted form aggregated results, in the form of a 'summary' table:
res = restring.summary(db)
res.to_csv("summary.csv")
These can be useful to find the most interesting terms across all comparisons: better p-value, presence in most/selected comparisons), as well as finding the most recurring DE genes for each term.
3. Visualizing the results
'aggregated'-type tables can be readily transformed into beautiful clustermaps. This is simply done by passing either the df
object previolsly created with tableize_aggregated()
, or the file name of the table that was saved from that object to the draw_clustermap()
function:
clus = restring.draw_clustermap("results.csv")
(The output may vary depending on your version of plotting libraries) Note that draw_clustermap()
actually returns the seaborn.matrix.ClusterGrid
object that it generates internally, this might come handy to retrieve the reordered (clustered) elements (more on that later).
The drawing function is basically a wrapper for seaborn.clustermap()
, of which retains all the flexible customization options, but allows for an immediate tweaking of the picture. For instance, we might just want to plot the highest-ranking terms and have all of them clearly written on a readable heatmap:
clus = restring.draw_clustermap("results.csv", pval_min=6, readable=True)
4. Polishing up
More tweaking is possibile:
help(restring.draw_clustermap)
Help on function draw_clustermap in module restring.restring:
draw_clustermap(data, figsize=None, sort_values=None, log_transform=True, log_base=10,
log_na=0, pval_min=None, custom_index=None, custom_cols=None,
unwanted_terms=None, title=None, title_size=24, savefig=False,
outfile_name='aggregated results.png', dpi=300, readable=False,
return_table=False, **kwargs)
Draws a clustermap of an 'aggregated'-type table.
This functions expects this table layout (example):
terms │ exp. cond 1 | exp. cond 2 | exp cond n ..
──────┼─────────────┼─────────────┼────────────..
term 1│ 0.01 | 1 | 0.00023
------┼-------------┼-------------┼------------..
term 2│ 1 | 0.05 | 1
------┼-------------┼-------------┼------------..
.. │ .. | .. | ..
*terms* must be indices of the table. If present, 'common' column
will be ignored.
Params
------
data A <pandas.DataFrame object>, or a filename. If a filename is
given, this will try to load the table as if it were produced
by tableize_aggregated() and saved with pandas.DataFrame.to_csv()
as a .csv file, with ',' as separator.
sort_values A <str> or <list> of <str>. Sorts the table accordingly. Please
*also* set col_cluster=False (see below, seaborn.clustermap()
additional parameters), otherwise columns will still be clustered.
log_transform If True, values will be log-transformed. Defaults to True.
note: values are turned into -log values, thus p-value of 0.05
gets transformed into 1.3 (with default parameters), as
10^-1.3 ~ 0.05
log_base If log_transform, this base will be used as the logarithm base.
Defaults to 10
log_na When unable to compute the logarithm, this value will be used
instead. Defaults to 0 (10^0 == 1)
pval_min Trims values to the ones matching *at least* this p value. If
log_transform with default values, this needs to be set accordingly
For example, if at least p=0.01 is desired, then aim for
pval_min=2 (10^-2 == 0.01)
custom_cols It is possible to pass a <list> of <str> to draw only specified
columns of input table
custom_index It is possible to pass a <list> of <str> to draw only specified
rows of input table
unwanted_terms If a <list> of <str> is supplied,
readable If set to False (default), the generated heatmap will be of reasonable
size, possibly not showing all term descriptors. Setting readable=True
will result into a heatmap where all descriptors are visible (this might
result into a very tall heatmap)
savefig If True, a picture of the heatmap will be saved in the current directory.
outfile_name The file name of the picture saved.
dpi dpi resolution of saved picture. Defaults to 300.
return_table If set to True, it also returns the table that was manipulated
internally to draw the heatmap (with all modifications applied).
Returns: <seaborn.matrix.ClusterGrid>, <pandas.core.frame.DataFrame>
**kwargs The drawing is performed by seaborn.clustermap(). All additional
keyword arguments are passed directly to it, so that the final picture
can be precisely tuned. More at:
https://seaborn.pydata.org/generated/seaborn.clustermap.html
With a few brush strokes we can obtain we picture we're looking for. Example:
bad = [
"Alzheimer's disease",
"Huntington's disease",
"Parkinson's disease",
"Retrograde endocannabinoid signaling",
"Staphylococcus aureus infection",
"Tuberculosis",
"Leishmaniasis",
"Herpes simplex infection",
"Kaposi's sarcoma-associated herpesvirus infection",
"Proteoglycans in cancer",
"Pertussis",
"Malaria",
"I'm not in the index"
]
clus = restring.draw_clustermap(
"results.csv", pval_min=8, title="Freakin' good results", unwanted_terms=bad,
sort_values="treatment_t1_wt_vs_DKO", row_cluster=False, annot=True,
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file restring-0.1.2.tar.gz
.
File metadata
- Download URL: restring-0.1.2.tar.gz
- Upload date:
- Size: 45.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.0 importlib_metadata/3.7.3 packaging/20.8 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | aa412bed3d280ae54484b4d172e64a110bd3d88bab377845db6804c00254c80e |
|
MD5 | ce48f85114ed6df696d63ffe2becd3d7 |
|
BLAKE2b-256 | 8af0fd105236f0a5a1bf2d2ea0a1743024aa572bb7ad5ff3b112671913fa2d2a |