Skip to main content

No project description provided

Project description

GeneDisco: A benchmark for active learning in drug discovery

Python version Library version

In vitro cellular experimentation with genetic interventions, using for example CRISPR technologies, is an essential step in early-stage drug discovery and target validation that serves to assess initial hypotheses about causal associations between biological mechanisms and disease pathologies. With billions of potential hypotheses to test, the experimental design space for in vitro genetic experiments is extremely vast, and the available experimental capacity - even at the largest research institutions in the world - pales in relation to the size of this biological hypothesis space.

GeneDisco (published at ICLR-22) is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery. GeneDisco contains a curated set of multiple publicly available experimental data sets as well as open-source i mplementations of state-of-the-art active learning policies for experimental design and exploration.

GeneDisco ICLR-22 Challenge

Learn more about the GeneDisco challenge (with a total of £20'000 of prizes) for experimental design for optimally exploring the vast genetic intervention space here.

Install

pip install genedisco

Use

How to Run the Full Benchmark Suite?

Experiments (all baselines, acquisition functions, input and target datasets, multiple seeds) included in GeneDisco can be executed sequentially for e.g. acquired batch size 64, 8 cycles and a bayesian_mlp model using:

run_experiments \
  --cache_directory=/path/to/genedisco_cache  \
  --output_directory=/path/to/genedisco_output  \
  --acquisition_batch_size=64  \
  --num_active_learning_cycles=8  \
  --max_num_jobs=1

Results are written to the folder at /path/to/genedisco_cache, and processed datasets will be cached at /path/to/genedisco_cache (please replace both with your desired paths) for faster startup in future invocations.

Note that due to the number of experiments being run by the above command, we recommend execution on a compute cluster.
The GeneDisco codebase also supports execution on slurm compute clusters (the slurm command must be available on the executing node) using the following command and using dependencies in a Python virtualenv available at /path/to/your/virtualenv (please replace with your own virtualenv path):

run_experiments \
  --cache_directory=/path/to/genedisco_cache  \
  --output_directory=/path/to/genedisco_output  \
  --acquisition_batch_size=64  \
  --num_active_learning_cycles=8  \
  --schedule_on_slurm \
  --schedule_children_on_slurm \
  --remote_execution_virtualenv_path=/path/to/your/virtualenv

Other scheduling systems are currently not supported by default.

How to Run A Single Isolated Experiment (One Learning Cycle)?

To run one active learning loop cycle, for example, with the "topuncertain" acquisition function, the "achilles" feature set and the "schmidt_2021_ifng" task, execute the following command:

active_learning_loop  \
    --cache_directory=/path/to/genedisco/genedisco_cache \
    --output_directory=/path/to/genedisco/genedisco_output \
    --model_name="bayesian_mlp" \
    --acquisition_function_name="topuncertain" \
    --acquisition_batch_size=64 \
    --num_active_learning_cycles=8 \
    --feature_set_name="achilles" \
    --dataset_name="schmidt_2021_ifng" 

How to Evaluate a Custom Acquisition Function?

To run a custom acquisition function, set --acquisition_function_name="custom" and --acquisition_function_path to the file path that contains your custom acquisition function.

active_learning_loop  \
    --cache_directory=/path/to/genedisco/genedisco_cache \
    --output_directory=/path/to/genedisco/genedisco_output \
    --model_name="bayesian_mlp" \
    --acquisition_function_name="custom" \
    --acquisition_function_path=/path/to/custom_acquisition_function.py \
    --acquisition_batch_size=64 \
    --num_active_learning_cycles=8 \
    --feature_set_name="achilles" \
    --dataset_name="schmidt_2021_ifng" 

...where "/path/to/custom_acquisition_function.py" contains code for your custom acquisition function corresponding to the BaseBatchAcquisitionFunction interface, e.g.:

import numpy as np
from typing import AnyStr, List
from slingpy import AbstractDataSource
from slingpy.models.abstract_base_model import AbstractBaseModel
from genedisco.active_learning_methods.acquisition_functions.base_acquisition_function import \
    BaseBatchAcquisitionFunction

class RandomBatchAcquisitionFunction(BaseBatchAcquisitionFunction):
    def __call__(self,
                 dataset_x: AbstractDataSource,
                 batch_size: int,
                 available_indices: List[AnyStr], 
                 last_selected_indices: List[AnyStr] = None, 
                 model: AbstractBaseModel = None,
                 temperature: float = 0.9,
                 ) -> List:
        selected = np.random.choice(available_indices, size=batch_size, replace=False)
        return selected

Note that the last class implementing BaseBatchAcquisitionFunction is loaded by GeneDisco if there are multiple valid acquisition functions present in the loaded file.

Citation

Please consider citing, if you reference or use our methodology, code or results in your work:

@inproceedings{mehrjou2022genedisco,
    title={{GeneDisco: A Benchmark for Experimental Design in Drug Discovery}},
    author={Mehrjou, Arash and Soleymani, Ashkan and Jesson, Andrew and Notin, Pascal and Gal, Yarin and Bauer, Stefan and Schwab, Patrick},
    booktitle={{International Conference on Learning Representations (ICLR)}},
    year={2022}
}

License

License

Authors

Patrick Schwab, GlaxoSmithKline plc
Arash Mehrjou, GlaxoSmithKline plc
Andrew Jesson, University of Oxford
Ashkan Soleymani, MIT

Acknowledgements

PS and AM are employees and shareholders of GlaxoSmithKline plc.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genedisco-1.0.0.tar.gz (726.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

genedisco-1.0.0-py3-none-any.whl (755.3 kB view details)

Uploaded Python 3

File details

Details for the file genedisco-1.0.0.tar.gz.

File metadata

  • Download URL: genedisco-1.0.0.tar.gz
  • Upload date:
  • Size: 726.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.4 tqdm/4.62.3 importlib-metadata/3.10.0 keyring/22.3.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.8

File hashes

Hashes for genedisco-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5d41161b097a559aa393cd096713f982a2eeb6f6d9c92fffb6e692db0e8da603
MD5 4714e2499b577e657a0ef81cc03dd353
BLAKE2b-256 63e5035382f92420332d2dba6b4a81bcd538e058ba7bd285ccdf4f08ee57f8a3

See more details on using hashes here.

File details

Details for the file genedisco-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: genedisco-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 755.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.4 tqdm/4.62.3 importlib-metadata/3.10.0 keyring/22.3.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.8

File hashes

Hashes for genedisco-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b81087de3333d2974b83c5216fb5ad28f8d4a1962c8d766821fef41c81be0d1a
MD5 d4eae0d612772697989689acc2e1b32d
BLAKE2b-256 66a4f0a728a95842495bc2f03c17bbe69441758b1058560c06d573833bafad72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page