Skip to main content

Scalable gene regulatory network inference using tree-based ensemble regressors with p-values

Project description

SignifiKANTE builds upon the arboreto software library to enable regression-based gene regulatory network inference and efficient, permutation-based empirical P-value computation for predicted regulatory links.

Installation

SignifiKANTE is installable via pip from PyPI using

pip install signifikante

or locally from this repository with

git clone git@github.com:bionetslab/SignifiKANTE.git
cd SignifiKANTE
pip install -e .

For installation with pixi, download pixi, install and run

git clone git@github.com:bionetslab/SignifiKANTE.git
cd SignifiKANTE
pixi install

Create a jupyter kernel using pixi.toml/pyproject.toml, which will install a jupyter kernel using a custom environment (including ipython)

git clone git@github.com:bionetslab/SignifiKANTE.git
cd SignifiKANTE
pixi run -e kernel install-kernel

Example workflow of SignifiKANTE’s FDR control

We provide an efficient FDR control for regulatory links based on any given regression-based GRN inference method. Currently, for GRN inference SignifiKANTE includes GRNBoost2, GENIE3, xgboost, and lasso regression. For the integration of further regression-based GRN inference methods, please see our manual in the section below. Here, we also provide a minimal working example of how to use SignifiKANTE based on GRNBoost2 on a simulated dataset:

import pandas as pd
import numpy as np
from signifikante.algo import signifikante_fdr

if __name__ == "__main__":

    # Simulate expression dataset with 100 samples and 10 genes.
    expression_data = np.random.randn(100, 10)
    expression_df = pd.DataFrame(expression_data, columns=[f"Gene{i}" for i in range(10)])
    # Simulate three artificial TFs.
    tf_list = [f"Gene{i}" for i in range(3)]

    # Run SignifiKANTE's approximate FDR control.
    fdr_grn = signifikante_fdr(
                expression_data=expression_df,
                normalize_gene_expression=True,
                tf_names=tf_list,
                cluster_representative_mode="random",
                num_target_clusters=2,
                inference_mode="grnboost2",
                apply_bh_correction=True)
    print(fdr_grn)

Parameter descriptions

Below, you can find a more detailed description of the parameters of SignifiKANTE’s central function for FDR control signifikante_fdr. The two absolutely necessary input parameters are:

  • expression_data [pd.DataFrame]: Expression matrix with genes as columns and samples as rows.

  • cluster_representative_mode [str]: How to draw representatives from target gene clusters. Can be one of “random” or “medoid” for approximate P-value computation, or “all_genes” for exact (DIANE-like) P-values.

Additional parameters of SignifiKANTE’s FDR control:

  • normalize_gene_expression [bool] : Whether or not to apply z-score normalization on gene columns in input expression matrix.

  • inference_mode [str]: Which GRN inference method to use under the hood. Can be one of “grnboost2”, “genie3”, “xgboost”, and “lasso”. Defaults to “grnboost2”.

  • num_permutations [int]: How many permutations to perform for random background model for empirical P-value computation. Defaults to 1000.

  • tf_names [list]: List of strings representing TF names. Should be subset of gene names contained in expression_data. Defaults to None. If no list is given, all genes are treated as potential TFs.

  • apply_bh_correction [bool]: Whether or not to additionally return Benjamini-Hochberg adjusted P-values.

  • input_grn [pd.DataFrame]: Reference GRN to use for FDR control. Needs to possess columns ‘TF’, ‘target’, ‘importance’. Should only be used, when it is clear that this GRN is inferred using the same method indicated in inference_mode. Defaults to None. If no reference GRN is given, a new one is inferred in the beginning.

  • target_subset [list]: Subset of target genes to consider for FDR control. Only compatible with “all_genes” FDR mode.

  • num_target_clusters [int]: Number of target gene clusters. If set to -1, no target gene clustering will be applied. Defaults to -1.

  • num_tf_clusters [int]: Experimental feature. Used for setting the number of desired TF clusters, if set to -1, no TF clustering will be applied. Defaults to -1.

  • target_cluster_mode [str]: Experimental feature. Indicates, which clustering to use for target gene clustering. Defaults to “wasserstein”.

  • tf_cluster_mode [str]: Experimental feature. Indicates, which clustering mode to use for TF clustering. Defaults to “correlation”.

  • scale_for_tf_sampling [bool]: Experimental feature. Whether or not to keep track of occurences of edges in permuted GRNs. Defaults to False.

Further more technical parameters:

  • client [str,Dask.Client]: Whether to perform computation on given input Dask Cluster object, or to create a new local one (“local”). Defaults to “local”.

  • early_stop_window_length [int]: Window length to use for early stopping. Defaults to 25.

  • seed [int]: Random seed for regressor models. Defaults to None.

  • verbose [bool]: Whether or not to print detailed additional information. Defaults to False.

  • output_dir [str]: Where to save additional intermediate data to. Defaults to None, i.e. saves no intermediate results.

The function returns a pandas dataframe representing the reference GRN with columns ‘TF’, ‘target’, and ‘importance’. The column ‘pvalue’ stores empirical P-values per edge. If apply_bh_correction=True, an additional column ‘pvalue_bh’ is returned.

Integration of additional regression-based GRN inference methods

In order to integrate new regression-based GRN inference methods into SignifiKANTE, simply use the following steps, which exemplify the integration of lasso regression as implemented in the GRENADINE package:

  1. Give your regression-based method an abbreviated string-based name (regressor_type) and name the variable storing its model-specific parameters (regressor_args), then add those to the existing accepted values of the inference_mode parameter within the function signifikante_fdr in the file algo.py, directly below the indicated line stating UPDATE FOR NEW GRN METHOD. In the case of lasso regression, we simply added the regressor type “LASSO” and the regressor parameters stored in LASSO_KWARGS in the respective code block:

# UPDATE FOR NEW GRN METHOD
if inference_mode == "grnboost2":
    regressor_type = "GBM"
    regressor_args = SGBM_KWARGS
# other existing methods...
elif inference_mode == "lasso":
    regressor_type = "LASSO"
    regressor_args = LASSO_KWARGS

Since the actual parameters of LASSO_KWARGS will be defined in another file, you need to make sure to import the variable into algo.py. To achieve this, simply add your new regressor’s arguments variable at the top of algo.py, directly below the indicated line stating UPDATE FOR NEW GRN METHOD, just like this:

# UPDATE FOR NEW GRN METHOD
from signifikante.core import (
    create_graph, SGBM_KWARGS, RF_KWARGS, EARLY_STOP_WINDOW_LENGTH, ET_KWARGS, XGB_KWARGS, LASSO_KWARGS
)
  1. Now we switch to the file core.py. At the top of the file, add any required import-statements for your regression to work (e.g. imports of sklearn). Below import statements, create a dictionary named exactly like the regressor’s arguments variable you imported in algo.py. You can include it directly below the line stating # UPDATE FOR NEW GRN METHOD, analogously to how we did it for the lasso regression:

from sklearn.linear_model import Lasso
# ... other code in between
LASSO_KWARGS = {
'alpha' : 0.01
}

The actual logic of your new regression-based inference method will be implemented in the function fit_model. There, you should implement a new local function that contains the logic of your new model, given a tf_matrix and a target_gene_expression vector, such as we did for lasso regression:

def do_lasso_regression():
    regressor = Lasso(**regressor_kwargs, random_state=seed)
    regressor.fit(tf_matrix, target_gene_expression)
    return regressor

Directly below, add another case distinction for your regressor_type which calls your locally defined function. The exact position is indicated by the line stating # UPDATE FOR NEW GRN METHOD:

# UPDATE FOR NEW GRN METHOD
if is_sklearn_regressor(regressor_type):
    return do_sklearn_regression()
# other methods...
elif is_lasso_regressor(regressor_type):
    return do_lasso_regression()

Finally, in the function to_feature_importances, you have to implement the extraction of feature importances or model coefficients from your trained_regressor, which are supposed to represent edge weights in the inferred GRN. To accomplish that, add another case for your new regressor in the case distinction below the line stating # UPDATE FOR NEW GRN METHOD. For lasso regression this looks like:

# UPDATE FOR NEW GRN METHOD
if is_oob_heuristic_supported(regressor_type, regressor_kwargs):
    # other code...
elif regressor_type.upper() == "LASSO":
    scores = np.abs(trained_regressor.coef_)
    return scores

Done, you have successfully added your new desired regression method for GRN inference!

Unit tests

Unit tests for arboreto-based functionalities, as well as additional tests for SignifiKANTE’s FDR control functionality and a comparison of our efficiently parallelized Wasserstein-distance computation against SciPy can be found under tests/. The tests are based on Python’s unittest module, and can be run all-together from the repository’s root-directory with

python -m unittest discover -s tests -v

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

signifikante-0.1.1.tar.gz (60.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

signifikante-0.1.1-py3-none-any.whl (27.4 kB view details)

Uploaded Python 3

File details

Details for the file signifikante-0.1.1.tar.gz.

File metadata

  • Download URL: signifikante-0.1.1.tar.gz
  • Upload date:
  • Size: 60.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for signifikante-0.1.1.tar.gz
Algorithm Hash digest
SHA256 a2de86c66ecd267d8c0b0bbc02e49340a1833a7db7b7e227deaebe7552d06f28
MD5 8eb455c26217f7c0faa4d8c3b345ae6d
BLAKE2b-256 ea6001dde4091c4631d6e83f5166e7e7861d22189eb81ebcc5961b3d7b378f85

See more details on using hashes here.

File details

Details for the file signifikante-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: signifikante-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 27.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for signifikante-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7010f681a5724552d3fe2d6000d25cf8541067d0aaa1869d51494b30f2895716
MD5 a2b4cb7b73d059e160ec149002244adf
BLAKE2b-256 7c29c386b187a035ee6af11459df978bce08362e005552d6d0ad67c907932bf9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page