Package which contains implementations of published collaborative filtering-based algorithms for drug repurposing.
Project description
BENCHmark for drug Screening with COllaborative FIltering (benchscofi) Python Package
This repository is a part of the EU-funded RECeSS project (#101102016), and hosts the implementations and / or wrappers to published implementations of collaborative filtering-based algorithms for easy benchmarking.
Statement of need
As of 2022, current drug development pipelines last around 10 years, costing $2billion in average, while drug commercialization failure rates go up to 90%. These issues can be mitigated by drug repurposing, where chemical compounds are screened for new therapeutic indications in a systematic fashion. In prior works, this approach has been implemented through collaborative filtering. This semi-supervised learning framework leverages known drug-disease matchings in order to recommend new ones.
There is no standard pipeline to train, validate and compare collaborative filtering-based repurposing methods, which considerably limits the impact of this research field. In benchscofi, the estimated improvement over the state-of-the-art (implemented in the package) can be measured through adequate and quantitative metrics tailored to the problem of drug repurposing across a large set of publicly available drug repurposing datasets.
Install the latest release
The fastest way to get access to all functionalities of benchscofi is to run the following command:
## Using the Docker image: will open a container
docker push recessproject/benchscofi:1.0.1
Documentation about benchscofi (and a manual installation) can be found at this page. The complete list of dependencies for benchscofi can be found at requirements.txt (pip).
Licence
This repository is under an OSI-approved MIT license.
Citation
If you use benchscofi in academic research, please cite it as follows
Réda, Clémence, Jill-Jênn Vie, and Olaf Wolkenhauer. "A new standard for drug repurposing by collaborative filtering: stanscofi and benchscofi." (2023).
Community guidelines with respect to contributions, issue reporting, and support
You are more than welcome to add your own algorithm to the package!
1. Add a novel implementation / algorithm
Add a new Python file (extension .py) in src/benchscofi/
named <model>
(where model
is the name of the algorithm), which contains a subclass of stanscofi.models.BasicModel
which has the same name as your Python file. At least implement methods preprocessing
, model_fit
, model_predict_proba
, and a default set of parameters (which is used for testing purposes). Please have a look at the placeholder file Constant.py
which implements a classification algorithm which labels all datapoints as positive. It is highly recommended to provide a proper documentation of your class, along with its methods. When pushing a new algorithm to benchscofi, it is automatically tested (see tests/test_models.py and TemplateTest.py which are run). In order to run this test locally, please run in the tests/
folder:
python3 -m test_models <model> <dataset:default=Synthetic>
2. Rules for contributors
Pull requests and issue flagging are welcome, and can be made through the GitHub interface. Support can be provided by reaching out to recess-project[at]proton.me
. However, please note that contributors and users must abide by the Code of Conduct.
Benchmark AUC and NDCG@items values (default parameters, single random training/testing set split) [updated 08/11/23]
These values (rounded to the closest 3rd decimal place) can be reproduced using the following command in folder tests/
python3 -m test_models <algorithm> <dataset:default=Synthetic> <batch_ratio:default=1>
:no_entry:'s represent failure to train or to predict. N/A
's have not been tested yet. When present, percentage in parentheses is the considered value of batch_ratio (to avoid memory crash on some of the datasets).
[mem]: memory crash
[err]: error
Algorithm (global AUC) | Synthetic* | TRANSCRIPT [a] | Gottlieb [b] | Cdataset [c] | PREDICT [d] | LRSSL [e] |
---|---|---|---|---|---|---|
PMF | 0.922 | 0.579 | 0.598 | 0.604 | 0.656 | 0.611 |
PulearnWrapper | 1.000 | :no_entry: | N/A | :no_entry: | :no_entry: | :no_entry: |
ALSWR | 0.971 | 0.507 | 0.677 | 0.724 | 0.693 | 0.685 |
FastaiCollabWrapper | 1.000 | 0.876 | 0.856 | 0.837 | 0.835 | 0.851 |
SimplePULearning | 0.995 | 0.949 (0.4) | :no_entry:[err] | :no_entry:[err] | 0.994 (4%) | :no_entry: |
SimpleBinaryClassifier | 0.876 | :no_entry:[mem] | 0.855 | 0.938 (40%) | 0.998 (1%) | :no_entry: |
NIMCGCN | 0.907 | 0.854 | 0.843 | 0.841 | 0.914 (60%) | 0.873 |
FFMWrapper | 0.924 | :no_entry:[mem] | 1.000 (40%) | 1.000 (20%) | :no_entry:[mem] | :no_entry: |
VariationalWrapper | :no_entry:[err] | :no_entry:[err] | 0.851 | 0.851 | :no_entry:[err] | :no_entry: |
DRRS | :no_entry:[err] | 0.662 | 0.838 | 0.878 | :no_entry:[err] | 0.892 |
SCPMF | 0.853 | 0.680 | 0.548 | 0.538 | :no_entry:[err] | 0.708 |
BNNR | 1.000 | 0.922 | 0.949 | 0.959 | 0.990 (1%) | 0.972 |
LRSSL | 0.127 | 0.581 (90%) | 0.159 | 0.846 | 0.764 (1%) | 0.665 |
MBiRW | 1.000 | 0.913 | 0.954 | 0.965 | :no_entry:[err] | 0.975 |
LibMFWrapper | 1.000 | 0.919 | 0.892 | 0.912 | 0.923 | 0.873 |
LogisticMF | 1.000 | 0.910 | 0.941 | 0.955 | 0.953 | 0.933 |
PSGCN | 0.767 | :no_entry:[err] | 0.802 | 0.888 | :no_entry: | 0.887 |
DDA_SKF | 0.779 | 0.453 | 0.544 | 0.264 (20%) | 0.591 | 0.542 |
HAN | 1.000 | 0.870 | 0.909 | 0.905 | 0.904 | 0.923 |
PUextraTrees (n_estimators=10 ) |
0.045 (50%) | 0.325 (50%) | 0.246 (20%) | :no_entry:[mem] | 0.309 (5%) | |
XGBoost (n_estimators=100 ) |
0.500 | 0.500 (20%) | 0.500 | 0.500 | 0.500 (1%) | 0.500 (60%) |
The NDCG score is computed across all diseases (global), at k=#items.
Algorithm (global NDCG@k) | Synthetic@300* | TRANSCRIPT@613[a] | Gottlieb@593[b] | Cdataset@663[c] | PREDICT@1577[d] | LRSSL@763[e] |
---|---|---|---|---|---|---|
PMF | 0.070 | 0.019 | 0.015 | 0.011 | 0.005 | 0.007 |
PulearnWrapper | N/A | :no_entry: | N/A | :no_entry: | :no_entry: | :no_entry: |
ALSWR | 0.000 | 0.177 | 0.236 | 0.406 | 0.193 | 0.424 |
FastaiCollabWrapper | 1.000 | 0.035 | 0.012 | 0.003 | 0.001 | 0.000 |
SimplePULearning | 1.000 | 0.059 (40%) | :no_entry:[err] | :no_entry:[err] | 0.025 (4%) | :no_entry:[err] |
SimpleBinaryClassifier | 0.000 | :no_entry:[mem] | 0.002 | 0.005 (40%) | 0.070 (1%) | :no_entry:[err] |
NIMCGCN | 0.568 | 0.022 | 0.006 | 0.005 | 0.007 (60%) | 0.014 |
FFMWrapper | 1.000 | :no_entry:[mem] | 1.000 (40%) | 1.000 (20%) | :no_entry:[mem] | :no_entry: |
VariationalWrapper | :no_entry:[err] | :no_entry:[err] | 0.011 | 0.010 | :no_entry:[err] | :no_entry: |
DRRS | :no_entry:[err] | 0.484 | 0.301 | 0.426 | :no_entry:[err] | 0.182 |
SCPMF | 0.528 | 0.102 | 0.025 | 0.011 | :no_entry:[err] | 0.008 |
BNNR | 1.000 | 0.466 | 0.417 | 0.572 | 0.217 (1%) | 0.508 |
LRSSL | 0.206 | 0.032 (90%) | 0.009 | 0.004 | 0.103 (1%) | 0.012 |
MBiRW | 1.000 | 0.085 | 0.267 | 0.352 | :no_entry:[err] | 0.457 |
LibMFWrapper | 1.000 | 0.419 | 0.431 | 0.605 | 0.502 | 0.430 |
LogisticMF | 1.000 | 0.323 | 0.106 | 0.101 | 0.076 | 0.078 |
PSGCN | 0.969 | :no_entry:[err] | 0.074 | 0.052 | :no_entry:[err] | 0.110 |
DDA_SKF | 1.000 | 0.039 | 0.069 | 0.078 (20%) | 0.065 | 0.069 |
HAN | 1.000 | 0.075 | 0.007 | 0.000 | 0.001 | 0.002 |
PUextraTrees (n_estimators=10 ) |
0.000 (50%) | 0.198 (50%) | 0.162 (20%) | :no_entry:[mem] | 0.235 (5%) | |
XGBoost (n_estimators=100 ) |
0.061 | 0.000 (20%) | 0.002 | 0.000 | 0.000 (1%) | 0.000 (60%) |
:no_entry: Note that results from ``LibMFWrapper'' are not reproducible, and the resulting metrics might slightly vary across iterations.
:no_entry: XGBoost and SimpleBinaryClassifier do not take into account unlabeled points (they assume they are negative points).
Datasets
*Synthetic dataset created with function generate_dummy_dataset
in stanscofi.datasets
and the following arguments:
npositive=200 #number of positive pairs
nnegative=100 #number of negative pairs
nfeatures=50 #number of pair features
mean=0.5 #mean for the distribution of positive pairs, resp. -mean for the negative pairs
std=1 #standard deviation for the distribution of positive and negative pairs
random_seed=124565 #random seed
[a] Réda, Clémence. (2023). TRANSCRIPT drug repurposing dataset (2.0.0) [Data set]. Zenodo. doi:10.5281/zenodo.7982976
[b] Gottlieb, A., Stein, G. Y., Ruppin, E., & Sharan, R. (2011). PREDICT: a method for inferring novel drug indications with application to personalized medicine. Molecular systems biology, 7(1), 496.
[c] Luo, H., Li, M., Wang, S., Liu, Q., Li, Y., & Wang, J. (2018). Computational drug repositioning using low-rank matrix approximation and randomized algorithms. Bioinformatics, 34(11), 1904-1912.
[d] Réda, Clémence. (2023). PREDICT drug repurposing dataset (2.0.1) [Data set]. Zenodo. doi:10.5281/zenodo.7983090
[e] Liang, X., Zhang, P., Yan, L., Fu, Y., Peng, F., Qu, L., … & Chen, Z. (2017). LRSSL: predict and interpret drug–disease associations based on data integration using sparse subspace learning. Bioinformatics, 33(8), 1187-1196.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file benchscofi-2.0.0.tar.gz
.
File metadata
- Download URL: benchscofi-2.0.0.tar.gz
- Upload date:
- Size: 64.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf51b091aaaf9162de38e922d7251f98d62056f5c2d7015fe2772b3191c1e28c |
|
MD5 | 878c84006fbdf3be0e436a514593183a |
|
BLAKE2b-256 | cd5e560093c5a292db42a7bee0fa77e52dc13c1ce5588943f7bad93d618e2505 |
File details
Details for the file benchscofi-2.0.0-py3-none-any.whl
.
File metadata
- Download URL: benchscofi-2.0.0-py3-none-any.whl
- Upload date:
- Size: 73.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2eeaac9bf708c027f81a24862124d27544a5e24aa1ab2519a7b7879bacbe774a |
|
MD5 | 9eb9bdbf2fe9eb761c2271c2d23c762a |
|
BLAKE2b-256 | b8b93a41e8935e0ccff5001c4e8640ccbdd9349c0155dca6d68e96225b5c2dcf |