BiGAnts - a package for network-constrained biclustering of omics data
Project description
BiGAnts: network-constrained biclustering of patients and multi-omics data
Table of contents
General info
PyPI package for conjoint clustering of networks and omics data. BiGants allows to conjointly cluster patients and genes such that (i) biclusters are restricted to functionally related genes connected in molecular interaction networks and (ii) the expression difference between two subgroups of patients is maximized.
Installation
To install the package from PyPI please run:
pip install bigants
To install the package from git:
git clone https://github.com/biomedbigdata/BiGAnts-PyPI-package
python setup.py install
Data input
The algorithm needs as an input one CSV matrix with gene expression/methylation/any other numerical data and one CSV file with a network.
Numerical data
Numerical data is accepted in the following format:
- genes as rows.
- patients as columns.
- first column - genes IDs (can be any IDs).
For instance:
Unnamed: 0 | GSM748056 | GSM748059 | ... | GSM748278 | GSM748279 | GSM1465989 | |
---|---|---|---|---|---|---|---|
0 | 1454 | 0.053769 | 0.117412 | ... | -0.392363 | -1.870838 | -1.432554 |
1 | 201931 | -0.618279 | 0.278637 | ... | 0.803541 | -0.514947 | 2.361925 |
2 | 8761 | 0.215820 | -0.343865 | ... | 0.700430 | 0.073281 | -0.977656 |
3 | 2703 | -0.504701 | 1.295049 | ... | 1.861972 | 0.601808 | 0.191013 |
4 | 26207 | -0.626415 | -0.646977 | ... | 2.331724 | 2.339122 | -0.100924 |
There are 2 examples of gene expression datasets that can be placed in the "data" folder
- GSE30219 - a Non-Small Cell Lung Cancer dataset from GEO for patients with either adenocarcinoma or squamous cell carcinoma.
- TCGA pan-cancer dataset with patients that have luminal or basal breast cancer. Both can be found here
Network
An interaction network should be present as a CSV table with two columns that represent two interacting genes. Without a header!
For instance:
6416 | 2318 | |
---|---|---|
0 | 6416 | 5371 |
1 | 6416 | 351 |
2 | 6416 | 409 |
3 | 6416 | 5932 |
4 | 6416 | 1956 |
There is an example of a PPI network from Bioigrid with experimentally validated interactions here.
Main functions
- bigants.data_preprocessing(path_expr, path_net, log2 = False, size = 2000)
Parameters:
- path_to_expr: string, path to the numerical data
- path_to_net: string, path to the network file
- log2: bool, (default = False), indicates if log2 transformation should be applied to the data
- size: int, optional (default = 2000) determines the number of genes that should be pre-selected by variance for the analysis. Shouldn't be higher than 5000.
Returns:
- GE: pandas data frame, processed expression data
- G: networkX graph, processed network data
- labels: dict, for mapping between real genes/patients IDs and the internal ones
- rev_labels: dict, additional dictionary for mapping between real genes/patients IDs and the internal ones
- bigants.BiGAnts(GE,G,L_g_min,L_g_max) creates a model for the given data:
Parameters:
- GE: pandas dataframe, processed expression data
- G: networkX graph, processed network data
- L_g_min: int, minimal solution subnetwork size
- L_g_max: int, maximal solution subnetwork size
Methods:
bigants.BiGAnts.run(self, n_proc = 1, K = 20, evaporation = 0.5, show_plot = False)
- K: int, default = 20, number of ants. Fewer ants - less space exploration. Usually set between 20 and 50
- n_proc: int, default = 1, number of processes that should be used
- evaporation, float, default = 0.5, the rate at which pheromone evaporates
- show_plot: bool, default = False, set true if convergence plots should be shown during the analysis
Example
Import the package:
import pandas as pd
from bigants import data_preprocessing
from bigants import BiGAnts
Set the paths to the expression matrix and the PPI network:
path_expr,path_net ='../bigants/data/gse30219_lung.csv', '../bigants/data/biogrid.human.entrez.tsv'
Load and process the data:
GE,G,labels, _= data_preprocessing(path_expr, path_net)
Set the size of subnetworks:
L_g_min = 10
L_g_max = 15
Set the model and run the search:
model = BiGAnts(GE,G,L_g_min,L_g_max)
solution,sc= model.run_search()
Results analysis
BiGAnts package also allows a user to save the results and perform an initial analysis.
The examples below show the basic usage, for more details please use python help() method, e.g. help(results.save)
.
- First of all, the object for results analysis must be created:
results = results_analysis(solution, labels)
This will allow to easily access the resulting biclusters in their initial IDs as well as perform a more complicated analysis.
To access IDs of patients in the first bicluster run:
results.patients1
To access IDs of genes IDs in the first bicluster run:
results.genes1
Same logic applies to the second bicluster.
- To save the solution:
#with the initial IDs
results.save(output = "results/results.csv")
#with gene names
results.save(output = "results/results.csv", gene_names = True)
- Visualise the resulting networks coloured with respect to their difference in expression patterns in patients clusters:
results.show_networks(GE, G, output = "results/network.png")
- Visualise a clustermap of the achieved solution alone or also along with the known patients' groups. Just with the BiGAnts results:
results.show_clustermap(GE, G, solution, labels, output = "results/clustermap.png")
If you have a patient's phenotype you would like to use for comparison, please make sure that patients IDs are exactly (!) matching the IDs that were used as an input. The IDs should be represented as a list of two lists, e.g.:
true_classes = ['GSM748056', 'GSM748059',..], ['GSM748278', 'GSM748279', 'GSM1465989']
results.show_clustermap(GE, G, solution, labels, output = "results/clustermap.png", true_labels = true_classes)
- Given a known phenotype in a format described above, BiGAnts can also return Jaccard index of the achieved patients clustering with a given phenotype:
results.jaccard_index(true_labels = true_classes)
- BiGAnts is using gseapy module to provide a user with a python wrapper for Enrichr database.
results.enrichment_analysis(solution, labels, library = 'GO_Biological_Process_2018', "results")
After the execution of the given above code, in the /results directory a user can find a table with enriched pathways as well as enrichment plots. Other available libraries can be used as well, e.g. 'GO_Molecular_Function_2018' and 'GO_Cellular_Component_2018'. In total there are 159 libraries available at the moment and the full list can be found by typing:
import gseapy
gseapy.get_library_name()
Cite
If you use BiGAnts in your research, we kindly ask you to cite the following publication:
Citation details to be announced
Contact
If you want to contact us regarding BiGAnts, please write an email to Olga Lazareva at olga.lazareva@wzw.tum.de
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.