A package implementing a Principal Components Analysis (PCA)-based approach to identify biological pathways driving inter-tumour heterogeneity in gene regulatory networks.
Project description
PORCUPINE
Principal Components Analysis to Obtain Regulatory Contributions Using Pathway-based Interpretation of Network Estimates is an R package to identify biological pathway which drive inter-tumour heterogeneity in a population of gene regulatory networks. It is a Principal Components Analysis (PCA)-based approach that can be used to determine whether a specific set of variables—for example a set of genes in a specific pathway—have coordinated variability in their regulation.
Method
PORCUPINE uses as input individual patient networks, for example networks modeled using PANDA and LIONESS, as well as a .gmt file that includes biological pathways and the genes belonging to them. For each pathway, it extracts all edges connected to the genes belonging to that pathway and scales each edge across individuals. It then performs a PCA analysis on these edge weights, as well as on a null background that is based on random pathways. To identify significant pathways, PORCUPINE applies a one-tailed t-test and calculates the effect size (ES). Here we provide an example of how we analyzed heterogeneity among single gene regulatory asmple networks in Leiomyosarcomas (LMS). Networks were obtained with PANDA and LIONESS algorithms. Our dataset contains data for 80 TCGA LMS patient-specific gene regulatory networks.
Setup
The requirements are provided in a requirements.txt file. Install package
pip install netzoo_porcupine
Usage
import netzoo_porcupine as pcp
import statsmodels.stats.multitest as smm
First, we load the network data (80_tcga_lms_net.RData) and edges information (rand_genes.RData). Data files are available here : https://zenodo.org/records/8105729. In this example, we have patient-specific gene regulatory networks for 80 TCGA leiomyosarcoma patients. The first three columns in the data provide information on the regulators (TFs) and target genes. Edges information includes three columns: reg (the transcription factor's gene symbol),tar (Ensembl ID), prior (whether an edge is prior (1) or not (0)).
net_file_path = "/div/pythagoras/u1/rosaliec/80_tcga_lms_net.RData"
net = pcp.load_net_file(net_file_path)
print(net.head())
# 0E244FE2-7C17-4642-A51F-2CCA796D9C70 75435ED8-93E8-45FB-8480-98D8EB2EF8CB \
#0 0.76 0.10
#1 0.94 1.43
#2 1.09 2.78
#3 1.13 2.60
#4 -0.71 -1.42
#
# B6D11678-15A9-4F43-A0A2-225067DCAF1C B7F5A41E-9559-4329-81F5-1B88A74730B7 \
#0 -1.27 0.01
#1 0.30 0.91
#2 1.01 2.13
#3 1.66 1.71
#4 0.02 0.27
#
# 04823F53-A12D-4852-8F34-77B9DCBB7DF0 49684C2B-D31C-4B45-A400-3497C3CCEC01 \
#0 -7.18 2.74
#1 -5.69 2.43
#2 -6.09 3.23
#3 -6.04 2.88
#4 5.31 -1.02
#
# FFDD7A12-DDEF-4974-8D60-64B7EEAAC994 830DFA6F-A85A-4317-82B2-791FAB998A01 \
#0 -0.80 -0.31
#1 -1.55 -2.56
#2 1.04 -2.07
#...
#3 0.13
#4 -0.06
edges_file_path = "/div/pythagoras/u1/rosaliec/edges.RData"
edges = pcp.load_edges_file(edges_file_path)
print(edges.head())
# reg tar prior
#0 AIRE A1BG 0.0
#1 ALX1 A1BG 0.0
#2 ALX3 A1BG 0.0
#3 ALX4 A1BG 0.0
#4 AR A1BG 0.0
print(len(edges['reg'].unique()))
#623
print(len(edges['tar'].unique()))
#17899
Our individual networks are represented by interactions between 623 TFs and 17,899 target genes. Then, we need to load in pathway file (.gmt file) available in the data folder.
pathways_file_path = "/div/pythagoras/u1/rosaliec/c2.cp.reactome.v7.1.symbols.gmt"
pathways = pcp.load_gmt(pathways_file_path)
print(len(pathways))
#1532
We need to filter pathways in order to include only pathways with genes that are present in our network file.
pathways_filt=pcp.filter_pathways(pathways,edges)
print(len(pathways_filt))
#1531
Then, we filter pathways based on their size, and all pathways with less than 10 and more than 150 genes are filtered out.
pathways_to_use=pcp.filter_pathways_size(pathways_filt,5,150)
print(len(pathways_to_use))
#1411
We select the top 10 pathways for the analysis (just to reduce computational time).
pathways_to_use=pathways_to_use[0:10]
Next, we perform PCA analysis for the 10 selected pathawys and extract the information on the variance explained by the first principal component in each pathway.
pca_res_pathways=pcp.pca_pathway(pathways_to_use, net, edges,scale_data = True, center_data = True)
print(pca_res_pathways)
# pathway pc1 n_edges \
#0 REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS 16.787642 8099
#1 REACTOME_PYRIMIDINE_CATABOLISM 23.986634 7476
#2 REACTOME_INHIBITION_OF_THE_PROTEOLYTIC_ACTIVIT... 18.545908 12460
#3 REACTOME_PYRUVATE_METABOLISM_AND_CITRIC_ACID_T... 20.999699 32396
#4 REACTOME_APOPTOTIC_CLEAVAGE_OF_CELLULAR_PROTEINS 16.808329 21182
#5 REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COM... 20.919336 20559
#6 REACTOME_PROCESSING_OF_INTRONLESS_PRE_MRNAS 18.129761 10591
#7 REACTOME_GAP_JUNCTION_DEGRADATION 37.223203 7476
#8 REACTOME_BASE_EXCISION_REPAIR 14.640845 27412
#9 REACTOME_BILE_ACID_AND_BILE_SALT_METABOLISM 21.326899 26789
#
# pathway_size
#0 13
#1 12
#2 20
#3 52
#4 34
#5 33
#6 17
#7 12
#8 44
#9 43
Then we perform a PCA analysis based on random gene sets. In this case we create 50 random gene sets for each pathway and run PCA for each gene set. This is just an example, for real comparisons we recommend to set n_perm to 1000.
pca_res_random = pcp.pca_random(net, edges, pca_res_pathways, pathways_filt, n_perm = 50, scale_data= True, center_data = True)
Then to identify significant pathways we run PORCUPINE, which compares the observed PCA score for a pathway to a set of PCA scores of random gene sets of the same size as pathway. Calculates p-value and effect size. The p-values can differ because of the random nature of this test.
res_porcupine = pcp.porcupine(pca_res_pathways,pca_res_random)
#P_adjust calculation using FDR method (Benjamini-Hochberg)
res_porcupine['p.adjust'] = smm.multipletests(res_porcupine['pval'], method='fdr_bh')[1]
print(res_porcupine)
#Pathways with size 13
#Pathways with size 12
#Pathways with size 20
#Pathways with size 52
#Pathways with size 34
#Pathways with size 33
#Pathways with size 17
#Pathways with size 44
#Pathways with size 43
# pathway pathway_size pc1 \
#0 REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS 13 16.787642
#1 REACTOME_PYRIMIDINE_CATABOLISM 12 23.986634
#2 REACTOME_GAP_JUNCTION_DEGRADATION 12 37.223203
#3 REACTOME_INHIBITION_OF_THE_PROTEOLYTIC_ACTIVIT... 20 18.545908
#4 REACTOME_PYRUVATE_METABOLISM_AND_CITRIC_ACID_T... 52 20.999699
#5 REACTOME_APOPTOTIC_CLEAVAGE_OF_CELLULAR_PROTEINS 34 16.808329
#6 REACTOME_ACTIVATION_OF_THE_PRE_REPLICATIVE_COM... 33 20.919336
#7 REACTOME_PROCESSING_OF_INTRONLESS_PRE_MRNAS 17 18.129761
#8 REACTOME_BASE_EXCISION_REPAIR 44 14.640845
#9 REACTOME_BILE_ACID_AND_BILE_SALT_METABOLISM 43 21.326899
#
# pval es p.adjust
#0 1.000000e+00 0.963420 1.000000e+00
#1 2.007815e-05 0.638230 5.019538e-05
#2 9.465539e-36 4.740924 9.465539e-35
#3 9.997105e-01 0.520523 1.000000e+00
#4 3.203126e-07 0.808265 1.601563e-06
#5 1.000000e+00 1.275528 1.000000e+00
#6 6.475769e-03 0.364771 1.295154e-02
#7 9.998534e-01 0.551486 1.000000e+00
#8 1.000000e+00 1.724658 1.000000e+00
#9 2.898786e-06 0.718975 9.662619e-06
Significant pathways can be selected based on adjusted p-value, explained variance and effect size.
To obtain pathway-based patient heterogeneity scores on the first two principal components.
ind_res = pcp.get_pathway_ind_scores(pathways_to_use, net,edges, scale_data = True, center_data = True)
print(ind_res[0:10])
# pathway \
#0E244FE2-7C17-4642-A51F-2CCA796D9C70 REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#75435ED8-93E8-45FB-8480-98D8EB2EF8CB REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#B6D11678-15A9-4F43-A0A2-225067DCAF1C REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#B7F5A41E-9559-4329-81F5-1B88A74730B7 REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#04823F53-A12D-4852-8F34-77B9DCBB7DF0 REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#49684C2B-D31C-4B45-A400-3497C3CCEC01 REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#FFDD7A12-DDEF-4974-8D60-64B7EEAAC994 REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#830DFA6F-A85A-4317-82B2-791FAB998A01 REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#58578614-E4A3-4655-BBAB-F65851625E0A REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#4139E0C9-F712-4A25-8B59-587533B93B3E REACTOME_GLYCOGEN_BREAKDOWN_GLYCOGENOLYSIS
#
# Dim.1 Dim.2
#0E244FE2-7C17-4642-A51F-2CCA796D9C70 -0.124027 -7.807079
#75435ED8-93E8-45FB-8480-98D8EB2EF8CB 20.641956 -6.844048
#B6D11678-15A9-4F43-A0A2-225067DCAF1C 6.240503 -8.771506
#B7F5A41E-9559-4329-81F5-1B88A74730B7 24.644962 -0.103552
#04823F53-A12D-4852-8F34-77B9DCBB7DF0 -8.556204 -4.645454
#49684C2B-D31C-4B45-A400-3497C3CCEC01 2.916440 -21.810588
#FFDD7A12-DDEF-4974-8D60-64B7EEAAC994 12.886658 -15.456415
#830DFA6F-A85A-4317-82B2-791FAB998A01 39.929420 5.942546
#58578614-E4A3-4655-BBAB-F65851625E0A -10.705200 -19.711465
#4139E0C9-F712-4A25-8B59-587533B93B3E -95.639682 60.061360
PORCUPINE allows to identify patient subtypes based on gene regulatory networks for each of the significant pathways. For this, K-means clustering is applied to the pathway-based patient heterogeneity scores on the first two principal components. The optimal number of clusters can be determined prior to clustering using the Average Silhouette Method.
Here we provide example of stratifying patients based on the “REACTOME VITAMIN B5 PANTOTHENATE METABOLISM” pathway.
print(pathways[37])
#('REACTOME_VITAMIN_B5_PANTOTHENATE_METABOLISM', ['COASY', 'VNN1', 'VNN2', 'PANK3', 'PDZD11', 'SLC25A16', 'PANK2', 'PPCS', 'ENPP2', 'SLC5A6', 'PPCDC', 'AASDHPPT', 'PANK1', 'ENPP3', 'PANK4', 'FASN', 'ENPP1'])
pcp.select_number_clusters(pathways[37],net, edges)
#The optimal number of clusters is 2. To visualize clusters
pcp.visualize_clusters(pathways[37], net, edges, number_of_clusters = 2)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file netzoo_porcupine-1.0.3.tar.gz.
File metadata
- Download URL: netzoo_porcupine-1.0.3.tar.gz
- Upload date:
- Size: 17.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8920b2d0e2eb52d00978dc31a627d0d458b99388ccf99a0b335a2d7c23b1affc
|
|
| MD5 |
4ef817bd408fbff5770d1bc1c9936222
|
|
| BLAKE2b-256 |
20a48d4d9818c7009e6b6704e6377d65493d7daa3bf4999c32ff8b49269bd2be
|
File details
Details for the file netzoo_porcupine-1.0.3-py3-none-any.whl.
File metadata
- Download URL: netzoo_porcupine-1.0.3-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2a06808369139f7be618ebe8cf991927fd7a2bd2b5694e86ebe35734b9ddde6
|
|
| MD5 |
6f26936c783c473422c945482765feea
|
|
| BLAKE2b-256 |
2f23e45018aaf1f8a8cb9cfff6593ab4813d355c08668eb5eb7ae9934ec62fb8
|