Skip to main content

A python package for modeling GSL metabolism and performing transcriptomic integration

Project description

pyGSLModel

pyGSLModel is a python package designed to facilitate performing metabolic simulations of human GlycoSphingoLipid (GSL) metabolism. It also provides an easy to use strategy for performing transcriptomic integration with iMAT on TCGA and GTEX expression data (accessed from the Xena database).

Quick Start Guide

Model Preparation

pyGSLModel includes functions to download the Human-GEM () genome scale metabolic model of human metabolism and to subsequently convert gene names (with mygene API) and prune the model with fastcore (with pyfastcore) to create a smaller GSL specific model if desired. This has notable performance benefits.

# Imports
from pyGSLModel import download_model, convert_genes, prune_model

# Downloading the HUMAN_GEM
model = download_model()

# Converting genes from ensembl IDs to gene names
model = convert_genes(model)

# Pruning the model with fastcore to  focus on GSL metabolism
model = prune_model(model)

# Removing unwanted transport reactions
model = remove_GSL_transport(model)

The pruning step takes longer as the function runs a number of test simulations on the full-scale model with different obejctives to ensure it doesn't miss any core reactions, it then subsequently performs fastcore pruning with the pyfastcore package. If you know you want to use the pruned GSL focused model, we made a dedicated function that downloads a pre-pruned and tidied model.

# Imports
from pyGSLModel import download_GSL_model

# Downloading the pruned GSL model
model = download_GSL_model()

Standard Simulations of Metabolism

With a prepared model, you are ready to run some metabolic simulations! To do this you can make use of the run_metabolic_model function. This function allows you to simulate your model given a particular method and objective function. The methods include standard FBA which would be enforced via method = "FBA", a multiple-FBA approach in which multiple solutions are found for different lienar paths within GSL metabolism based on the selected objective and recombined into a distribution representative ensemble model: method = "mFBA". To specify the GSL objective function for simulation you select one of D14_Neuron for a Day 14 I3Neuron, D28_Neuron for a Day 28 I3Neuron, AC for an IAstrocyte or MG for an IMicroglia in the objective_choice= argument. These define the set of reactions in the objective function that the solution will optimise for.

# Imports
from pyGSLModel import run_metabolic_model

# Running a simulation
solution = run_metabolic_model(model, method="mFBA", objective_choice="D14_Neuron")

The run_metabolic_model function also allows for easy analysis of gene knockouts with the knockout argument. This is by default set to "WT" which won't change the model, however if you input a gene symbol, this will be knocked out of the model before the simulation is performed e.g., knockout = "B4GALNT1".

# Running a simulation with a gene knocked out.
solution_ko = run_metabolic_model(model, method="mFBA", objective_choice="D14_Neuron", knockout="B4GALNT1")

Analysing and Plotting Results

As pyGSLModels focus is to simplify analysis of GSL metabolism, there are also included functions to analyse the metabolic simulation outputs. These include a tabulation function, a standard bar chart plotting function, and also a network visualisation function that draws the graph of GSL metabolism, with edge widths weighted by reaction flux from the simulation. This network graph is saved to a target folder as an interactive .html file which can be opened in a browser. This allows the user to customise the network to their liking.

# Imports
from pyGSLModel import tabulate_model_results, plot_model_results, visualise_flux_network

# First you can tabule the model results for key GSLs
results_df = tabulate_model_results(model, solution)

# You can then use that data frame as input to the default plotting function
results_fig = plot_model_results(results_df)

# You can also produce an interactive .html representation of the GSL metabolic network with edges weighted by flux
visualise_flux_network(model,solution,file_path="C:/path/to/folder/my_network.html",height="1080px",width="100%",rxn_col="#1f77b4",met_col="#abdcff")

Transcriptomic Integration

pyGSLModel includes a number of functions to integrate transcriptomic data with your metabolic model to generate fluxes.

There are four functions to integrate transcriptomic data with the iMAT method (implemented here with imatpy). This method doesn't require a conventional objective function, instead it uses the gene expression data to drive the simulation. This is useful for complex multicellular organisms and tissues, where a clear biological objective is difficult to discern.

Two of the functions utilise TCGA and GTEX data from the TCGA_TARGET_GTEX dataset (Xena).

The first of these functions: TCGA_iMAT_Integrate() takes TCGA data averaged over multiple samples for each cancer/normal tissue and performs iMAT based on genes of GSL metabolism. You can tune the function by adjusting the upper/lower quantiles for how the data is processed. An upper_quantile = 0.25 and a lower_quantile = 0.75 means the top 25% of expression values will be assigned a +1, the bottom 25% will be assigned -1 and the remaining values assigned a 0. This is the format needed for iMAT to optimise the model to prioritise high expression gene associated reactions. This function gives two outputs, a dataframe with the flux results in each tissue for each key lipid, but also a dictionary containing the solution objects for further analysis.

The second of these functions: TCGA_iMAT_sample_integrate() performs simarly to TCGA_iMAT_Integrate(), however it doesn't average the data for a particular cancer/tissue, instead it calculates fluxes for each individual sample in the data set for the given tissue argument. You can also decide if you want to run simulations for both TCGA and GTEX data or just TCGA data with the dataset argument. This produces a lot more data and needs to run a lot of simulations so can take some time depending on the tissue you choose. Due to large numbers of samples this function only gives one output, a dataframe with flux results for each sample.

# Imports
from pyGSLModel import TCGA_iMAT_integrate, TCGA_iMAT_sample_integrate

# Performing iMAT on TCGA and GTEX data where expression has been averaged across samples of the same disease/tissue
results_avg_df, sol_dict_TCGA = TCGA_iMAT_integrate(model, upper_quantile = 0.25, lower_quantile = 0.75, epsilon=1, threshold=0.01)

# Performing iMAT on TCGA and GTEX data for each individual sample for a given tissue
results_sample_df = TCGA_iMAT_sample_integrate(model, tissue="Brain", dataset="TCGA", upper_quantile = 0.25, lower_quantile=0.75, epsilon=1, threshold=0.01)

The third and fourth iMAT integration methods allows the user to input their own data. The required format is a dataframe where each column represents a different sample and the index contains the gene symbols. The first of these methods (iMAT_integrate) take a dataframe with only a single column for the sample and returns a cobrapy solution object, the second of these methods (iMAT_multi_integrate) can take multiple sample columns but returns a dataframe with the fluxes for each sample stored.

# Imports
import pandas as pd
from pyGSLModel import iMAT_integrate

### Single Integration ###

import pandas as pd
# Making a dummy dataframe
d = {
    "Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1","A4GALT"],
    "Sample_1" : [8,6,4,2]
}

# Converting the dictionary to a pandas dataframe and setting the index to Gene
iMAT_df_1 = pd.DataFrame(d)
iMAT_df_1 = iMAT_df_1.set_index("Gene").copy()

sol_4 = iMAT_integrate(model_3,iMAT_df_1)


### Multi Integration ###
from pyGSLModel import iMAT_multi_integrate

# Making a mock dataframe
d = {
    "Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1"],
    "Sample_1" : [10,5,1],
    "Sample_2" : [1,5,10],
    "Sample_3" : [5,1,10],
}

mock_df = pd.DataFrame(d)
mock_df = mock_df.set_index("Gene")

# Perfoming iMAT
results_custom_df = iMAT_multi_integrate(model, mock_df, upper_quantile = 0.25, lower_quantile = 0.75, epsilon=1, threshold=0.01)

Calculating GSL risk scores for prognosis prediction in lower grade glioma based on RNA-Seq data

pyGSLModel enables the calculation of GSl risk scores for lower grade glioma RNA-Seq data. The risk scores are calculated through an ensemble machine learning approach. For more details see the github repository: JACKWJW Github LGG Prognosis Prediction. Check the repository for updates as to when the paper is published!

The Ensemble machine learning approach was validated by training on TCGA data, before validation against CGGA data. Final models were then trained using both TCGA and CGGA data. Models can be accessed via huggingface: JACKWJW Hugging Face

To calculate the GSL risk score your your data, you can use the calculate_GSL_score function. Data should be provided in the form of pandas dataframe with gene symbols as index and samples as columns. It is essential that expression values are recorded as TPMs (the function will log2 normalise the TPM values internally, so raw TPMs must be provided).

# Imports
from pyGSLModel import calculate_GSL_score

# Data should be in the format with gene symbols as index and samples as columns.
# You can provide your full RNA-seq datasets in this format, the function will trim the data to include only the relevant columns internally.
# Expression data should be in the TPM format
d = {
    "Gene" : ["B4GALNT1", "ST3GAL5", "ST8SIA1","A4GALT"],
    "Sample_1" : [8,6,4,2],
    "Sample_2" : [2,4,6,8]
}

# Converting the dictionary to a pandas dataframe and setting the index to Gene
test_df = pd.DataFrame(d)
test_df = test_df.set_index("Gene").copy()

# Perfoming GSL risk score calculation
results_df = calculate_GSL_score(data=test_df)

Dependencies and License

pyGSLModel is under a GNU license. Dependecies include:

  • requests (Apache)
  • cobra (GNU)
  • pyfastcore (MIT)
  • mygene (BSD)
  • pandas (BSD)
  • matplotlib (PSF)
  • seaborn (BSD)
  • imatpy (MIT)
  • numpy (BSD)
  • networkx (BSD)
  • pyvis (BSD)
  • pytorch (BSD)
  • sklearn (BSD)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygslmodel-1.0.0.tar.gz (33.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pygslmodel-1.0.0-py3-none-any.whl (31.7 kB view details)

Uploaded Python 3

File details

Details for the file pygslmodel-1.0.0.tar.gz.

File metadata

  • Download URL: pygslmodel-1.0.0.tar.gz
  • Upload date:
  • Size: 33.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pygslmodel-1.0.0.tar.gz
Algorithm Hash digest
SHA256 b99c7333a39987839704d0cfe5153024c8a072032a0d10417029862baa3e3b2b
MD5 0cdae769df832b21bcff44b4a767e34a
BLAKE2b-256 b20d73e3b9070e4af62b96932cacc25278c65b7f51e8bb9df489340a6172ce03

See more details on using hashes here.

File details

Details for the file pygslmodel-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: pygslmodel-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 31.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for pygslmodel-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 35bc886d6ff2a3b7f435e85d8c5afa5db8dd74a63c7d269e670c214edd209873
MD5 76e7a6e3547665d1106928b21824d2f8
BLAKE2b-256 1856ffbea50bb7b18f44784ec0f3f528c7cc85607778bd31a8636ebe43fe9391

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page