BiCoN - a package for network-constrained biclustering of omics data

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

BiCoN: network-constrained biclustering of patients and multi-omics data

General info
Installation
Data input
Main functions
Example
Quality control
Cite
Contact

General info

Unsupervised learning approaches are frequently employed to identify patient subgroups and biomarkers such as disease-associated genes. Biclustering is a powerful technique often used with expression data to cluster genes along with patients. However, the genes forming biclusters are often not functionally related, complicating interpretation of the results.

To alleviate this, we developed the network-constrained biclustering approach BiCoN which (i) restricts biclusters to functionally related genes connected in molecular interaction networks and (ii) maximizes the expression difference between two subgroups of patients.

alt text

Installation

To install the package from PyPI please run:

pip install bicon

To install the package from git:

git clone https://github.com/biomedbigdata/BiCoN

python setup.py install

Data input

The algorithm needs as an input a matrix with gene expression/methylation/any other numerical data and one file with a network.

Any gene IDs can be used (see results processing)

Numerical data

Numerical data is accepted in the following format:

genes as rows.
patients as columns.
first column - genes IDs (can be any IDs).

For instance:

Unnamed: 0	GSM748056	GSM748059	...	GSM748278	GSM748279	GSM1465989
1454	0.053769	0.117412	...	-0.392363	-1.870838	-1.432554
201931	-0.618279	0.278637	...	0.803541	-0.514947	2.361925
8761	0.215820	-0.343865	...	0.700430	0.073281	-0.977656
2703	-0.504701	1.295049	...	1.861972	0.601808	0.191013
26207	-0.626415	-0.646977	...	2.331724	2.339122	-0.100924

There are 2 examples of gene expression datasets that can be placed in the "data" folder

GSE30219 - a Non-Small Cell Lung Cancer dataset from GEO for patients with either adenocarcinoma or squamous cell carcinoma.
TCGA pan-cancer dataset with patients that have luminal or basal breast cancer. Both can be found here

Network

An interaction network should be present as a table with two columns that represent two interacting genes. Without a header!

For instance:

6416	2318
6416	5371
6416	351
6416	409
6416	5932
6416	1956

There is an example of a PPI network from BioiGRID with experimentally validated interactions here.

Main functions

Here we give a general description of the main functions provided. Please note, that all functions are annotated with dockstrings and therefore the full information can be found with help() method, e.g. help(results.save).

1.data_preprocessing(path_expr, path_net, log2 = False, zscores = True, size = 2000, no_zero = None, formats = [])

Parameters:

path_to_expr: string, path to the numerical data
path_to_net: string, path to the network file
log2: bool, (default = False), indicates if log2 transformation should be applied to the data
size: int, optional (default = 2000) determines the number of genes that should be pre-selected by variance for the analysis. Shouldn't be higher than 5000.
no_zero: (default - none) indicate the fraction of allowed non-zero values for each patient. For instance no_zero = 0.8 means that all genes which have no expression for at least 80% of patients will be removed from the analysis
formats: list (defailt - determined automatically), list of formats for gene expression matrix and PPI table. It is advisable to provide formats to avoid mistakes in files reading. For example, if the gene expression matrix is given in .csv format and PPI in .tsv format, then format = ["csv", "tsv"]

Returns:

GE: pandas data frame, processed expression data
G: networkX graph, processed network data
labels: dict, for mapping between real genes/patients IDs and the internal ones
rev_labels: dict, additional dictionary for mapping between real genes/patients IDs and the internal ones

BiCoN*(GE,G,L_g_min,L_g_max) creates a model for the given data:

Parameters:

GE: pandas dataframe, processed expression data
G: networkX graph, processed network data
L_g_min: int, minimal solution subnetwork size
L_g_max: int, maximal solution subnetwork size

Methods:

BiCoN.run(self, n_proc = 1, K = 20, evaporation = 0.5, show_plot = False)

K: int, default = 20, number of ants. Fewer ants - less space exploration. Usually set between 20 and 50
n_proc: int, default = 1, number of processes that should be used(can not be more than K)
evaporation, float, default = 0.5, the rate at which pheromone evaporates
show_plot: bool, default = False, set true if convergence plots should be shown during the iterations

Example

Import the package:

from bicon import data_preprocessing
from bicon import BiCoN
from bicon import results_analysis

Set the paths to the expression matrix and the PPI network:

path_expr,path_net ='/data/gse30219_lung.csv', '/data/biogrid.human.entrez.tsv'

Load and process the data:

GE,G,labels, _= data_preprocessing(path_expr, path_net)

Set the size of subnetworks:

L_g_min = 10
L_g_max = 15

Set the model and run the search:

model = BiCoN(GE,G,L_g_min,L_g_max)
solution,scores= model.run_search()

Results analysis

BiCoN package also allows a user to save the results and perform an initial analysis. The examples below show the basic usage, for more details please use python help() method, e.g. help(results.save).

First, the object for results analysis must be created:

results = results_analysis(solution, labels)

This will allow to easily access the resulting biclusters and their initial IDs as well as perform a more complicated analysis.

To access IDs of patients in the first bicluster:

results.patients1

To access IDs of genes IDs in the first bicluster:

results.genes1

Same logic applies to the second bicluster.

If in the further analysis you would like to use gene names, please set 'convert' to True and specify the original gene IDs, i.e.:

results = results_analysis(solution, labels, convert = True, origID = 'entrezgene')

Some other options for the original gene ID: ensembl.gene', 'symbol', 'refseq', 'unigene', etc For all possibe option please check the reference for MyGene.info gene query web service

To save the solution:

results.save(output = "results/results.csv")

Visualise the resulting networks colored with respect to their difference in expression patterns in patients clusters:

results.show_networks(GE, G, output = "results/network.png")

Visualise a clustermap of the achieved solution alone or also along with the known patients' groups. Just with the BiCoN results:

results.show_clustermap(GE, G, output = "results/clustermap.png")

If you have a patient's phenotype you would like to use for comparison, please make sure that patients IDs are exactly (!) matching the IDs that were used as an input. The IDs should be represented as a list of two lists, e.g.:

true_classes = [['GSM748056', 'GSM748059',..], ['GSM748278', 'GSM748279', 'GSM1465989']]
results.show_clustermap(GE, G, output = "results/clustermap.png", true_labels = true_classes)

Given a known phenotype in a format described above, BiCoN can also return Jaccard index of the achieved patients clustering with a given phenotype:

results.jaccard_index(true_labels = true_classes)

BiCoN is using gseapy module to provide a user with a python wrapper for Enrichr database.

results.enrichment_analysis(library = 'GO_Biological_Process_2018', output="results")

After the execution of the given above code, in the /results directory a user can find a table with enriched pathways as well as enrichment plots. Other available libraries can be used as well, e.g. 'GO_Molecular_Function_2018' and 'GO_Cellular_Component_2018'. In total there are 159 libraries available at the moment and the full list can be found by typing:

import gseapy
gseapy.get_library_name()

Quality control

Algorithm convergence

The best way to check if the algorithm produced high-quality results and there are no issues with the parameters is to analyse the convergence plot:

results.convergence_plot(scores)

The algorithm converged:

If the maximum score has stabilised for several iterations in a row (default is 6).

If the average score became equal (or nearly equal) to the maximal score:

text

The algorithm did not converge:

If the average and the maximal score improve over the iterations but do not stabilize then just increase the number of maximally allowed iterations:

text

If the scores do not stabilize even after 60-100 iterations, please contact us.

Bad probability update

If you got the following error message:

AssertionError: bad probability update

It can mean one of the following issues:

The setting of the algorithm is way to restrictive for your problem. You can try to fix it by repeating the analysis with th = 0, or even th = -1 e.g.:

model = BiCoN(GE,G,L_g_min,L_g_max)
solution,scores= model.run_search(th = 0)

Otherwise, the problem might be related to the way you have processed your data. Please make sure that you do not have not expressed genes for the magority of the patients.

Cite

BiCoN was developed by the Big Data in BioMedicine group and Computational Systems Medicine group at Chair of Experimental Bioinformatics.

If you use BiCoN in your research, we kindly ask you to cite the following manuscript: Lazareva, O., Van Do, H., Canzar, S., Yuan, K., Tieri, P.,Baumbach, J., Kacprowski, T., List, M.: BiCoN: Network-constrained biclustering of patients and omics data. [Submitted]

Contact

If you have difficulties using BiCoN, please open an issue at out GitHub repository. Alternatevely, you can write an email to:

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.3.4

Oct 12, 2022

1.3.3

Mar 15, 2022

1.3.2

Oct 18, 2021

1.3.1

Oct 18, 2021

1.3.0

Apr 14, 2021

1.2.14

Jan 24, 2021

1.2.13

Jan 14, 2021

1.2.12

Jan 14, 2021

1.2.11

Dec 15, 2020

1.2.10

Dec 8, 2020

1.2.9

Dec 4, 2020

1.2.8

Dec 4, 2020

1.2.7

Dec 4, 2020

1.2.6

Nov 23, 2020

1.2.5

Nov 3, 2020

1.2.4

Sep 14, 2020

1.2.3

Jul 24, 2020

1.2.2

Apr 21, 2020

1.2.1

Apr 19, 2020

1.1.8

Apr 6, 2020

1.1.7

Apr 3, 2020

1.1.6

Apr 2, 2020

1.1.5

Apr 2, 2020

1.1.4

Apr 1, 2020

1.1.3

Apr 1, 2020

1.0.11

Apr 1, 2020

1.0.10

Mar 12, 2020

1.0.9

Mar 5, 2020

1.0.8

Feb 27, 2020

1.0.7

Jan 30, 2020

1.0.6

Jan 30, 2020

1.0.5

Jan 30, 2020

1.0.4

Jan 29, 2020

1.0.3

Jan 29, 2020

1.0.2

Jan 29, 2020

1.0.1

Jan 22, 2020

1.0.0

Jan 21, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bicon-1.3.4.tar.gz (36.5 kB view details)

Uploaded Oct 12, 2022 Source

File details

Details for the file bicon-1.3.4.tar.gz.

File metadata

Download URL: bicon-1.3.4.tar.gz
Upload date: Oct 12, 2022
Size: 36.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.7

File hashes

Hashes for bicon-1.3.4.tar.gz
Algorithm	Hash digest
SHA256	`258160234200ee3531bd22796752124f734861f91a2edf4d266eabe87cc3ac2b`
MD5	`646e8942cc2aa163e9e95e6ec1e51ccc`
BLAKE2b-256	`25ba4a15daba032f6f72f798fdce771d0ef6b020c52d8d5d3ce2fa13cc207153`

See more details on using hashes here.

bicon 1.3.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BiCoN: network-constrained biclustering of patients and multi-omics data

Table of contents

General info

Installation

Data input

Numerical data

Network

Main functions

Example

Results analysis

Quality control

Algorithm convergence

Bad probability update

Cite

Contact

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes