Skip to main content

Bioinformatics Tool for the Integrative Analysis of Alternative Splicing Regulome using RNA-Seq data

Project description

regulAS

python sqlite license

A Bioinformatics Tool for the Integrative Analysis of Alternative Splicing Regulome using RNA-Seq data across cancer and tissue types

Overview

regulAS is a computational software for analysis of regulatory mechanisms of alternative splicing enabling to dissect complex relationships between alternative splicing events and thousands of its prospective regulatory RBPs using large-scale RNA-seq data derived from The Cancer Genome Atlas (TCGA) and The Genotype-Tissue Expression (GTEx) projects.

regulAS is a powerful machine learning experiment management tool written in Python that combines flexible configuration system based on YAML and shell, fast and reliable data storage driven by SQLite relational database engine and an extension-friendly API compatible with scikit-learn.

regulAS is designed to provide users with a "Low-Code" experiment management solution that is dedicated for researchers of alternative splicing regulatory mechanisms to simplify computational workflow, allowing them to alleviate the number of the prospective regulatory candidates of splicing changes for further in-depth bioinformatics and experimental analysis.

Structure

regulAS is supplied with a set of pre-defined modules that introduce support for data acquisition, fitting of machine learning models, persistence and export of results.

Workflow

The regulAS package encapsulates ETL-, ML- and report generation workflows. ETL-workflow includes data Extraction, Transformation and Loading operations, thus preparing the input for the next step, namely Machine Learning (ML) part. The ML-workflow incorporates predictive modeling, hyper-parameter optimization, performance evaluation and feature ranking tasks for identifying candidate regulators of alternative splicing events across tumors and tissue types. The ML-workflow outputs can subsequently be utilized for generating summary reports in tabular and visual forms to support interpretation of the findings and knowledge sharing.

regulAS workflow

Data loading

After the transformation step, regulAS loads the data into a pandas.DataFrame, which is a default container to store intermediate results. Given an additional argument denoting the target for supervised model training, data loader splits the DataFrame into samples and targets that are used by the ML-part.

Configuration

Experimental setups are configured through YAML files using Facebook Hydra framework.

Evaluation of ML-models

The Machine Learning sub-flow manages hyper-parameter tuning, performance evaluation and scoring of feature relevance. To tune hyper-parameters, regulAS performs a cross validation procedure that is similar to the scikit-learn grid search (GridSearchCV). The model performance evaluation requires comparison of ground truth targets to the actual predictions. Thus, both true and predicted values are stored into a database that serves as a data source for performance evaluation reports. Finally, regulAS detects if a model provides information on feature importance and intercepts this scoring putting it into a database for later analysis.

Storing results

regulAS tracks and stores a wide variety of experimental data that includes:

  • Dataset description
  • Experimental configuration
  • Actual hyper-parameters
  • Source code for models
  • True and predicted target values
  • Feature importance

Typically, a single experiment is able to provide the management system with several hundred thousand records, which makes storing them a non-trivial task. To address this challenge, regulAS relies on a relational database powered by the SQLite engine. The use of the SQLite database ensures data integrity and provides fast and reliable storage for it.

Report generation

The report generation workflow allows assembling results and summarizing them in an appropriate way (e.g., text- or image-based output). Moreover, the reports can be chained such as the output of a preceding report is fed into the next one depending on it.

Configuration

How to use YAML configuration files and (optionally) combine them with command line arguments.

A single experimental setup defines a scope named experiment that encapsulates the following sub-scopes:

  • name – name of the experiment
  • dataset – configuration of a data source
  • split – cross validation parameters
  • pipelines – list of ML-configurations to compare
  • reports – data export configurations

Using Facebook Hydra framework, regulAS provides the ability to override configuration keys from the command line, to define required-in-runtime arguments (using reserved value ???) and to substitute values given a key to another field of a configuration file (${experiment.another_field}).

Sample experimental setup

regulAS configuration can define model evaluation tasks only, while keeping the list of reports empty.

experiment_tasks.yaml
name: lr_svr_tmr

dataset:
  _target_: regulAS.utils.PickleLoader
  name: RNA-Seq
  meta: some fancy description
  path_to_file: ???
  objective: psi

split:
  _target_: sklearn.model_selection.ShuffleSplit
  n_splits: 5
  test_size: 0.2
  train_size: null
  random_state: ${random_state}

pipelines:
  - transformations:
      ZScore:
        _target_: sklearn.preprocessing.StandardScaler
    model:
      LinearRegression:
        _target_: sklearn.linear_model.ElasticNet
        l1_ratio: 0.2
        _varargs_:
          alpha: [0.1, 0.5, 1.0]

  - transformations:
      MinMax:
        _target_: sklearn.preprocessing.MinMaxScaler
    model:
      SupportVectorMachine:
        _target_: sklearn.svm.SVR
        kernel: linear
        _varargs_:
          C: [0.1, 0.5, 1.0]

reports: null

Sample report setup

Similarly, there can be only report tasks defined in the experimental setup.

report_tasks.yaml
name: lr_svr_tmr

dataset: null
split: null
pipelines: null

reports:
  MSE:
    _target_: regulAS.reports.ModelPerformanceReport
    experiment_name: ${experiment.name}
    loss_fn: sklearn.metrics.mean_squared_error

  PearsonR:
    _target_: regulAS.reports.ModelPerformanceReport
    experiment_name: ${experiment.name}
    score_fn: scipy.stats.pearsonr

  FeatureRanking:
    _depends_on_: ["MSE"]
    _target_: regulAS.reports.FeatureRankingReport
    experiment_name: ${experiment.name}
    sort_by: "${experiment.reports.MSE.loss_fn}:test:mean"
    sort_ascending: true
    top_k_models: 3
    top_k_features: null

  PerformanceCSV:
    _depends_on_: ["MSE", "PearsonR"]
    _target_: regulAS.reports.ExportCSV
    output_dir: reports
    sep: ";"

  RankingCSV:
    _depends_on_: ["FeatureRanking"]
    _target_: regulAS.reports.ExportCSV
    output_dir: reports
    sep: ";"

Database

Database structure is dedicated to reliably store experimental results and to serve as a data source for reports based on its contents. Schema of the regulAS' database was designed to preserve consistency and to minimize redundancy of the stored records. The database includes the following tables:

  • Experiment – contains high-level summary of the experiments
  • Data – contains details on the dataset used in Experiment
  • Pipeline – contains high-level summary on specific pipelines used in Experiment
  • Transformation – contains details on ML-models and transformations
  • HyperParameter – contains hyper-parameters of models
  • Prediction – contains true and predicted values for target
  • FeatureRanking – contains feature importance scores
  • TransformationSequence – contains details on the use of models and transformations
  • HyperParameterValue – contains details on the specific values of hyper-parameters

regulAS database

Usage

In order to use regulAS for experimental management, user should define a project directory containing YAML configuration files and data if necessary. The experiment then can be run from the command line.

# install regulAS package from PyPi
pip install regulAS
# navigate to the user's project folder
cd /path/to/the/project
# run the experimental setup on different dataset files
python -m regulAS.app --multirun \
experiment=experiments/experiment_tasks \
+dataset.path_to_file=data/data_cNormal.pkl,data/data_cTumor.pkl

After regulAS finishes all the tasks, an SQLite database file regulAS.db will be stored in the project directory as well as the reports if any was submitted to generate.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

regulAS-2023.7.1.tar.gz (28.9 kB view details)

Uploaded Source

Built Distribution

regulAS-2023.7.1-py3-none-any.whl (33.6 kB view details)

Uploaded Python 3

File details

Details for the file regulAS-2023.7.1.tar.gz.

File metadata

  • Download URL: regulAS-2023.7.1.tar.gz
  • Upload date:
  • Size: 28.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for regulAS-2023.7.1.tar.gz
Algorithm Hash digest
SHA256 21441bc7bbfffe96ed860a14b9d4054fb4fb5bc3e3d76f613233b23f19dde016
MD5 a0022a7c42bb2e85541a42d07cfd1d4e
BLAKE2b-256 b7dfb59b961b9ec7b476f33670982149d08c34a4a7d2b647c68b8afcf3331ed3

See more details on using hashes here.

File details

Details for the file regulAS-2023.7.1-py3-none-any.whl.

File metadata

  • Download URL: regulAS-2023.7.1-py3-none-any.whl
  • Upload date:
  • Size: 33.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.17

File hashes

Hashes for regulAS-2023.7.1-py3-none-any.whl
Algorithm Hash digest
SHA256 98a12c6cf4b80aa97fc237180636e64d16ddee14a058f024a2ca3eca8271bfe1
MD5 59bbc61d0a5e409b05d73c7252dc5596
BLAKE2b-256 7be16a61c1db2de4f893bd1588785588bab604eaeeaad4c31c63c1cc1c95df22

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page