A python package for modeling GSL metabolism and performing transcriptomic integration
Project description
pyGSLModel
pyGSLModel is a python package designed to facilitate performing metabolic simulations of human GlycoSphingoLipid (GSL) metabolism. It also provides an easy to use strategy for performing transcriptomic integration with iMAT on TCGA and GTEX expression data (accessed from the Xena database).
Quick Start Guide
Model Preparation
pyGSLModel includes functions to download the Human-GEM () genome scale metabolic model of human metabolism and to subsequently convert gene names (with mygene API) and prune the model with fastcore (with pyfastcore) to create a smaller GSL specific model if desired. This has notable performance benefits.
# Imports
from pyGSLModel import download_model, convert_genes, prune_model
# Downloading the HUMAN_GEM
model = download_model()
# Converting genes from ensembl IDs to gene names
model = convert_genes(model)
# Pruning the model with fastcore to focus on GSL metabolism
model = prune_model(model)
# Removing unwanted transport reactions
model = remove_GSL_transport(model)
The pruning step takes longer as the function runs a number of test simulations on the full-scale model with different obejctives to ensure it doesn't miss any core reactions, it then subsequently performs fastcore pruning with the pyfastcore package. If you know you want to use the pruned GSL focused model, we made a dedicated function that downloads a pre-pruned and tidied model.
# Imports
from pyGSLModel import download_GSL_model
# Downloading the pruned GSL model
model = download_GSL_model()
Standard Simulations of Metabolism
With a prepared model, you are ready to run some metabolic simulations!
To do this you can make use of the run_metabolic_model function. This function allows you to simulate your model given a particular method and objective function.
The methods include standard FBA which would be enforced via method = "FBA", a multiple-FBA approach in which multiple solutions are found for different lienar paths within GSL metabolism based on the selected objective and recombined into a distribution representative ensemble model: method = "mFBA". To specify the GSL objective function for simulation you select one of D14_Neuron for a Day 14 I3Neuron, D28_Neuron for a Day 28 I3Neuron, AC for an IAstrocyte or MG for an IMicroglia in the objective_choice= argument. These define the set of reactions in the objective function that the solution will optimise for.
# Imports
from pyGSLModel import run_metabolic_model
# Running a simulation
solution = run_metabolic_model(model, method="mFBA", objective_choice="D14_Neuron")
The run_metabolic_model function also allows for easy analysis of gene knockouts with the knockout argument. This is by default set to "WT" which won't change the model, however if you input a gene symbol, this will be knocked out of the model before the simulation is performed e.g., knockout = "B4GALNT1".
# Running a simulation with a gene knocked out.
solution_ko = run_metabolic_model(model, method="mFBA", objective_choice="D14_Neuron", knockout="B4GALNT1")
Analysing and Plotting Results
As pyGSLModels focus is to simplify analysis of GSL metabolism, there are also included functions to analyse the metabolic simulation outputs. These include a tabulation function, a standard bar chart plotting function, and also a network visualisation function that draws the graph of GSL metabolism, with edge widths weighted by reaction flux from the simulation. This network graph is saved to a target folder as an interactive .html file which can be opened in a browser. This allows the user to customise the network to their liking.
# Imports
from pyGSLModel import tabulate_model_results, plot_model_results, visualise_flux_network
# First you can tabule the model results for key GSLs
results_df = tabulate_model_results(model, solution)
# You can then use that data frame as input to the default plotting function
results_fig = plot_model_results(results_df)
# You can also produce an interactive .html representation of the GSL metabolic network with edges weighted by flux
visualise_flux_network(model,solution,file_path="C:/path/to/folder/my_network.html",height="1080px",width="100%",rxn_col="#1f77b4",met_col="#abdcff")
Transcriptomic Integration
pyGSLModel includes a number of functions to integrate transcriptomic data with your metabolic model to generate fluxes.
There are four functions to integrate transcriptomic data with the iMAT method (implemented here with imatpy). This method doesn't require a conventional objective function, instead it uses the gene expression data to drive the simulation. This is useful for complex multicellular organisms and tissues, where a clear biological objective is difficult to discern.
Two of the functions utilise TCGA and GTEX data from the TCGA_TARGET_GTEX dataset (Xena).
The first of these functions: TCGA_iMAT_Integrate() takes TCGA data averaged over multiple samples for each cancer/normal tissue and performs iMAT based on genes of GSL metabolism. You can tune the function by adjusting the upper/lower quantiles for how the data is processed. An upper_quantile = 0.25 and a lower_quantile = 0.75 means the top 25% of expression values will be assigned a +1, the bottom 25% will be assigned -1 and the remaining values assigned a 0. This is the format needed for iMAT to optimise the model to prioritise high expression gene associated reactions. This function gives two outputs, a dataframe with the flux results in each tissue for each key lipid, but also a dictionary containing the solution objects for further analysis.
The second of these functions: TCGA_iMAT_sample_integrate() performs simarly to TCGA_iMAT_Integrate(), however it doesn't average the data for a particular cancer/tissue, instead it calculates fluxes for each individual sample in the data set for the given tissue argument. You can also decide if you want to run simulations for both TCGA and GTEX data or just TCGA data with the dataset argument. This produces a lot more data and needs to run a lot of simulations so can take some time depending on the tissue you choose. Due to large numbers of samples this function only gives one output, a dataframe with flux results for each sample.
# Imports
from pyGSLModel import TCGA_iMAT_integrate, TCGA_iMAT_sample_integrate
# Performing iMAT on TCGA and GTEX data where expression has been averaged across samples of the same disease/tissue
results_avg_df, sol_dict_TCGA = TCGA_iMAT_integrate(model, upper_quantile = 0.25, lower_quantile = 0.75, epsilon=1, threshold=0.01)
# Performing iMAT on TCGA and GTEX data for each individual sample for a given tissue
results_sample_df = TCGA_iMAT_sample_integrate(model, tissue="Brain", dataset="TCGA", upper_quantile = 0.25, lower_quantile=0.75, epsilon=1, threshold=0.01)
The third and fourth iMAT integration methods allows the user to input their own data. The required format is a dataframe where each column represents a different sample and the index contains the gene symbols. The first of these methods (iMAT_integrate) take a dataframe with only a single column for the sample and returns a cobrapy solution object, the second of these methods (iMAT_multi_integrate) can take multiple sample columns but returns a dataframe with the fluxes for each sample stored.
# Imports
import pandas as pd
from pyGSLModel import iMAT_integrate
### Single Integration ###
import pandas as pd
# Making a dummy dataframe
d = {
"Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1","A4GALT"],
"Sample_1" : [8,6,4,2]
}
# Converting the dictionary to a pandas dataframe and setting the index to Gene
iMAT_df_1 = pd.DataFrame(d)
iMAT_df_1 = iMAT_df_1.set_index("Gene").copy()
sol_4 = iMAT_integrate(model_3,iMAT_df_1)
### Multi Integration ###
from pyGSLModel import iMAT_multi_integrate
# Making a mock dataframe
d = {
"Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1"],
"Sample_1" : [10,5,1],
"Sample_2" : [1,5,10],
"Sample_3" : [5,1,10],
}
mock_df = pd.DataFrame(d)
mock_df = mock_df.set_index("Gene")
# Perfoming iMAT
results_custom_df = iMAT_multi_integrate(model, mock_df, upper_quantile = 0.25, lower_quantile = 0.75, epsilon=1, threshold=0.01)
Calculating GSL risk scores for prognosis prediction in lower grade glioma based on RNA-Seq data
pyGSLModel enables the calculation of GSl risk scores for lower grade glioma RNA-Seq data. The risk scores are calculated through an ensemble machine learning approach. For more details see the github repository: JACKWJW Github LGG Prognosis Prediction. Check the repository for updates as to when the paper is published!
The Ensemble machine learning approach was validated by training on TCGA data, before validation against CGGA data. Final models were then trained using both TCGA and CGGA data. Models can be accessed via huggingface: JACKWJW Hugging Face
To calculate the GSL risk score your your data, you can use the calculate_GSL_score function. Data should be provided in the form of pandas dataframe with gene symbols as index and samples as columns. It is essential that expression values are recorded as TPMs (the function will log2 normalise the TPM values internally, so raw TPMs must be provided).
# Imports
from pyGSLModel import calculate_GSL_score
# Data should be in the format with gene symbols as index and samples as columns.
# You can provide your full RNA-seq datasets in this format, the function will trim the data to include only the relevant columns internally.
# Expression data should be in the TPM format
d = {
"Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1","A4GALT"],
"Sample_1" : [8,6,4,2],
"Sample_2" : [2,4,6,8]
}
# Converting the dictionary to a pandas dataframe and setting the index to Gene
test_df = pd.DataFrame(d)
test_df = test_df.set_index("Gene").copy()
# Perfoming GSL risk score calculation
results_df = calculate_GSL_score(data=test_df)
Dependencies and License
pyGSLModel is under a GNU license. Dependecies include:
- requests (Apache)
- cobra (GNU)
- pyfastcore (MIT)
- mygene (BSD)
- pandas (BSD)
- matplotlib (PSF)
- seaborn (BSD)
- imatpy (MIT)
- numpy (BSD)
- networkx (BSD)
- pyvis (BSD)
- pytorch (BSD)
- sklearn (BSD)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pygslmodel-1.0.0.tar.gz.
File metadata
- Download URL: pygslmodel-1.0.0.tar.gz
- Upload date:
- Size: 33.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b99c7333a39987839704d0cfe5153024c8a072032a0d10417029862baa3e3b2b
|
|
| MD5 |
0cdae769df832b21bcff44b4a767e34a
|
|
| BLAKE2b-256 |
b20d73e3b9070e4af62b96932cacc25278c65b7f51e8bb9df489340a6172ce03
|
File details
Details for the file pygslmodel-1.0.0-py3-none-any.whl.
File metadata
- Download URL: pygslmodel-1.0.0-py3-none-any.whl
- Upload date:
- Size: 31.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35bc886d6ff2a3b7f435e85d8c5afa5db8dd74a63c7d269e670c214edd209873
|
|
| MD5 |
76e7a6e3547665d1106928b21824d2f8
|
|
| BLAKE2b-256 |
1856ffbea50bb7b18f44784ec0f3f528c7cc85607778bd31a8636ebe43fe9391
|