A framework for evaluating link prediction models on heterogeneous biomedical graph data
Project description
OpenBioLink
OpenBioLink is a resource and evaluation framework for evaluating link prediction models on heterogeneous biomedical graph data. It contains benchmark datasets as well as the underlying scrips to create them and to evaluate a costume model on them.
Installation
Pip
- Install a pytorch version suitable for your system https://pytorch.org/
pip install openbiolink
Source
- clone the git repository or download the project
- Create a new python3.7, or python3.6 virtual environment (note: under Windows, only python3.6 will work)
e.g.:
python3 -m venv my_venv
- activate the virtual environment
- windows:
my_venv\Scrips\activate
- linux/mac:
source my_venv/bin/activate
- windows:
- Install a pytorch version suitable for your system https://pytorch.org/
- Install the requirements stated in requirements.txt e.g.
pip install -r requirements.txt
Benchmark Dataset
The OpenBioLink2020 Dataset is a highly challenging benchmark dataset containing over 5 million positive and negative edges. The test set does not contain trivially predictable, inverse edges from the training set and does contain all different edge types, to provide a more realistic edge prediction scenario.
Leaderboard
model | hits@10 | hits@1 | paper | code |
---|---|---|---|---|
TransE (Baseline) | 0.0749 | 0.0125 | (under review) | Code |
TransR (Baseline) | 0.0639 | 0.0096 | (under review) | Code |
To also be able to analyze the effect of the data quality as well as the directionality of the evaluation graph other settings of OpenBioLink2020 are provided, in directed and undirected setting, with and without quality cutoff.
- OpenBioLink2020: directed, high quality (default dataset)
- OpenBioLink2020: undirected, high quality
- OpenBioLink2020: directed, no quality cutoff
- OpenBioLink2020: undirected, no quality cutoff
Manual
The OpenBioLink framework consists of three parts, called actions
- graph creation
- train-test split creation
- training and evaluation
With the graph creation and the train-test set action, costumed data sets can be created to suit individual needs. The last action serves as interface to train and evaluate link prediction models.
Calling via GUI
By calling the program without any parameters, the gui is started, providing a handy interface to define parameters needed. In the last step, the corresponding command line options are displayed.
Calling via command line
From folder src
python -m openbiolink.openBioLink -p WORKING_DIR_PATH [-action] [--options] ...
Action: Graph Creation
-g:
--undir Output-Graph should be undirectional (default = directional)
--qual quality cutoff of the output-graph, options = [hq, mq, lq], (default = None -> all entries are used)
--no_interact Disables interactive mode - existing files will be replaced (default = interactive)
--skip Existing files will be skipped - in combination with --no_interact (default = replace)
--no_dl No download is being performed (e.g. when local data is used)
--no_in No input_files are created (e.g. when local data is used)
--no_create No graph is created (e.g. when only in-files should be created)
--out_format [Format] [Sep] Format of graph output, takes 2 arguments: list of file formats
[s= single file, m=multiple files] and list of separators
(e.g. t=tab, n=newline, or any other character) (default= s t)
--no_qscore The output files will contain no scores
--dbs [Cls] custom source databases selection to be used, full class name, options --> see doc
--mes [Cls] custom meta edges selection to be used, full class name, options --> see doc
Action: Train- Test Split Generation
-s
--edges Path Path to edges.csv file (required with action -s
--tn_edges Path Path to true_negatives_edges.csv file (required with action -s)
--nodes Path Path to nodes.csv file (required with action -s)
--tts_sep [Sep] Separator of edge, tn-edge and nodes file (e.g. t=tab, n=newline,
or any other character) (default=t)
--mode rand|time Mode of train-test-set split, options=[rand, time], (default=rand)
--test_frac F Fraction of test set as float (default= 0.2)
--crossval Multiple train-validation-sets are generated
--val F fraction of validation set as float (default= 0.2) or number of folds as int
--tmo_edges Path Path to edges.csv file of t-minus-one graph (required for --mode time
--tmo_tn_edges Path Path to true_negatives_edges.csv file of t-minus-one graph (required for --mode time)
--tmo_nodes Path Path to nodes.csv file of t-minus-one graph (required for --mode time)
Action: Training and Evaluation
-e
--model_cls Cls class of the model to be trained/evaluated (required with -e)
--config Path Path to the models config file
--no_train No training is being performed, trained model id provided via --trained_model
--trained_model Path Path to trained model (required with --no_train)
--no_eval No evaluation is being performed, only training
--test Path Path to test set file (required with -e)
--train Path Path to trainings set file')
--eval_nodes Path Path to the nodes file (required for ranked triples if no corrupted triples
file is provided and nodes cannot be taken from graph creation
--metrics [Metric] list of evaluation metrics
--ks [K] k's for hits@k metric (integer list)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for openbiolink-0.1.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 782f3cc4e623916c5fd825c0f50acc172b63d4cddce0fe474d22dcc33d9118d3 |
|
MD5 | 4a0e52ae5bb7bbd9f3b05fb107c362fc |
|
BLAKE2b-256 | 0d145ab3677518ec6e5e95b90a6fc32092217bb277f4bfc9f26d062ac5563a0a |