Skip to main content

A Python toolkit for biological network learning evaluation

Project description

PyPI version Documentation Status License: MIT Code style: black Imports: isort

Tests Test Examples Test Data

Open Biomedical Network Benchmark

Installation

Clone the repository first and then install via pip

git clone https://github.com/krishnanlab/NetworkLearningEval && cd NetworkLearningEval
pip install -e .

The -e option means 'editable', i.e. no need to reinstall the library if you make changes to the source code. Feel free to not use the -e option and simply do pip install . if you do not plan on modifying the source code.

Optional Pytorch Geometric installation

User need to install Pytorch Geomtric to enable some GNN related features. To install PyG, first need to install PyTorch. For full details about installation instructions, visit the links above. Assuming the system has Python3.8 or above installed, with CUDA10.2, use the following to install both PyTorch and PyG.

conda install pytorch=1.12.1 torchvision cudatoolkit=10.2 -c pytorch
pip install torch-geometric==2.0.4 torch-scatter torch-sparse torch-cluster -f https://data.pyg.org/whl/torch-1.12.1+cu102.html

Quick install using the installation script

source install.sh cu102  # other options are [cpu,cu113]

Quick Demonstration

Construct default datasets

We provide a high-level dataset constructor to help user effortlessly set up a ML-ready dataset for a combination of network and label. In particular, the dataset will be set up with study-bias holdout split (6/2/2), where 60% of the most well studied genes according to the number of associated PubMed publications are used for training, 20% of the least studied genes are used for testing, and rest of the 20% genes are used for validation. For more customizable data loading and processing options, see the customized dataset construction section below.

from obnb.util.dataset_constructors import default_constructor

root = "datasets"  # save dataset and cache under the datasets/ directory
version = "nledata-v0.1.0-dev3"  # archive data version, use 'latest' to pull latest data from source instead

# Download and process network/label data. Use the adjacency matrix as the ML feature
dataset = default_constructor(root=root, version=version, graph_name="BioGRID", label_name="DisGeNET",
                              graph_as_feature=True, use_dense_graph=True)

Evaluating standard models

Evaluation of simple machine learning methods such as logistic regression and label propagation can be done easily using the trainer objects. The trainer objects take a dictionary of metrics as input for evaluating the models' performances, and can be set up as follows.

from obnb.metric import auroc
from obnb.model_trainer import SupervisedLearningTrainer, LabelPropagationTrainer

metrics = {"auroc": auroc}  # use AUROC as our default evaluation metric
sl_trainer = SupervisedLearningTrainer(metrics)
lp_trainer = LabelPropagationTrainer(metrics)

Then, use the eval_multi_ovr method of the trainer to evaluate a given ML model over all tasks in a one-vs-rest setting.

from sklearn.linear_model import LogisticRegression
from obnb.model.label_propagation import OneHopPropagation

# Initialize models
sl_mdl = LogisticRegression(penalty="l2", solver="lbfgs")
lp_mdl = OneHopPropagation()

# Evaluate the models over all tasks
sl_results = sl_trainer.eval_multi_ovr(sl_mdl, dataset)
lp_results = lp_trainer.eval_multi_ovr(lp_mdl, dataset)

Evaluating GNN models

Training and evaluation of Graph Neural Network (GNN) models can be done in a very similar fashion.

from torch_geometric.nn import GCN
from obnb.model_trainer.gnn import SimpleGNNTrainer

# Use 1-dimensional trivial node feature
dataset = default_constructor(root=root, version=version, graph_name="BioGRID", label_name="DisGeNET")

# Train and evaluate a GCN
gcn_mdl = GCN(in_channels=1, hidden_channels=64, num_layers=5, out_channels=n_tasks)
gcn_trainer = SimpleGNNTrainer(metrics, device="cuda", metric_best="auroc")
gcn_results = gcn_trainer.train(gcn_mdl, dataset)

Customized dataset construction

Load network and labels

from obnb import data

root = "datasets"  # save dataset and cache under the datasets/ directory

# Load processed BioGRID data from archive.
# Alternatively, set version="latest" to get and process the newest data from scratch.
g = data.BioGRID(root, version="nledata-v0.1.0-dev3")

# Load DisGeNET gene set collections.
lsc = data.DisGeNET(root, version="latest")

Setting up data and splits

from obnb.util.converter import GenePropertyConverter
from obnb.label.split import RatioHoldout

# Load PubMed count gene property converter and use it to set up study-bias holdout split
pubmedcnt_converter = GenePropertyConverter(root, name="PubMedCount")
splitter = RatioHoldout(0.6, 0.4, ascending=False, property_converter=pubmedcnt_converter)

Filter labeled data based on network genes and splits

# Apply in-place filters to the labelset collection
lsc.iapply(
    filters.Compose(
        # Only use genes that are present in the network
        filters.EntityExistenceFilter(list(g.node_ids)),
        # Remove any labelsets with less than 50 network genes
        filters.LabelsetRangeFilterSize(min_val=50),
        # Make sure each split has at least 10 positive examples
        filters.LabelsetRangeFilterSplit(min_val=10, splitter=splitter),
    ),
)

Combine into dataset

from obnb import Dataset
dataset = Dataset(graph=g, feature=g.to_dense_graph().to_feature(), label=lsc, splitter=splitter)

Data preparation and releasing notes

First, bump data version in __init__.py to the next data release version, e.g., nledata-v0.1.0 -> nledata-v0.1.1-dev. Then, download and process all latest data by running

python script/release_data.py

By default, the data ready to be uploaded (e.g., to Zenodo) is saved under data_release/archived. After some necessary inspection and checking, if everything looks good, upload and publish the new archived data.

Note: dev data should be uploaded to the sandbox instead.

Check items:

  • Update __data_version__
  • Run release_data.py
  • Upload archived data to Zenodo (be sure to edit the data version there also)
  • Update url dict in config (will improve in the future to get info from Zenodo directly)
  • Update network stats in data test

Finally, commit and push the bumped version.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

obnb-0.1.0.dev1.tar.gz (116.0 kB view details)

Uploaded Source

Built Distribution

obnb-0.1.0.dev1-py3-none-any.whl (268.4 kB view details)

Uploaded Python 3

File details

Details for the file obnb-0.1.0.dev1.tar.gz.

File metadata

  • Download URL: obnb-0.1.0.dev1.tar.gz
  • Upload date:
  • Size: 116.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for obnb-0.1.0.dev1.tar.gz
Algorithm Hash digest
SHA256 d438e267e6fbec64bd9934010b517f985ce8f2cfa27f78ff7933c39825700a02
MD5 f4db5d5781f37a521a77cebb5ab77c84
BLAKE2b-256 0ced9e5bb225a9dea1e3feee3635e9fc7e7ffea7d9d12cd106783c7ff1b34b61

See more details on using hashes here.

Provenance

File details

Details for the file obnb-0.1.0.dev1-py3-none-any.whl.

File metadata

  • Download URL: obnb-0.1.0.dev1-py3-none-any.whl
  • Upload date:
  • Size: 268.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for obnb-0.1.0.dev1-py3-none-any.whl
Algorithm Hash digest
SHA256 34c1e84a3766901e6540c6f669e5d87e36b355d37817512ecb3f01b2527ac60c
MD5 71ec93f19e26122748450210af2a7210
BLAKE2b-256 07accb8055a762f685e79e2af1ae67c50f680a73b172715d2a6d7584f81cda28

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page