Skip to main content

Generative Toolkit for Scientific Discovery (GT4SD).

Project description

GT4SD (Generative Toolkit for Scientific Discovery)

PyPI version Actions tests License: MIT Code style: black Contributions Docs Total downloads Monthly downloads

logo

The GT4SD (Generative Toolkit for Scientific Discovery) is an open-source platform to accelerate hypothesis generation in the scientific discovery process. It provides a library for making state-of-the-art generative AI models easier to use.

For full details on the library API and examples see the docs.

Installation

requirements

Currently gt4sd relies on:

  • python>=3.7,<3.8
  • pip>=19.1,<20.3

We are actively working on relaxing these, so stay tuned or help us with his by contributing to the project.

pip

If you simply want to use gt4sd in your projects, install it via pip from PyPI:

pip install gt4sd

You can also install gt4sd directly from GitHub:

pip install git+https://github.com/GT4SD/gt4sd-core

NOTE: As of now (:eyes: on issue for changes), some dependencies require installation from GitHub, so for a complete setup install them with:

pip install -r vcs_requirements.txt

Development setup & installation

If you would like to contribute to the package, we recommend the following development setup:

git clone git@github.com:GT4SD/gt4sd-core.git
cd gt4sd-core
conda env create -f conda.yml
conda activate gt4sd
pip install --no-deps -e .

Learn more in CONTRIBUTING.md

Getting started

After install you can use gt4sd right away in your discovery workflows.

Running inference pipelines

Running an algorithm is as easy as typing:

from gt4sd.algorithms.conditional_generation.paccmann_rl.core import (
    PaccMannRLProteinBasedGenerator, PaccMannRL
)
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'
# algorithm configuration with default parameters
configuration = PaccMannRLProteinBasedGenerator()
# instantiate the algorithm for sampling
algorithm = PaccMannRL(configuration=configuration, target=target)
items = list(algorithm.sample(10))
print(items)

Or you can use the ApplicationRegistry to run an algorithm instance using a serialized representation of the algorithm:

from gt4sd.algorithms.registry import ApplicationsRegistry
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'
algorithm = ApplicationsRegistry.get_application_instance(
    target=target,
    algorithm_type='conditional_generation',
    domain='materials',
    algorithm_name='PaccMannRL',
    algorithm_application='PaccMannRLProteinBasedGenerator',
    generated_length=32,
    # include additional configuration parameters as **kwargs
)
items = list(algorithm.sample(10))
print(items)

Running training pipelines via the CLI command

GT4SD provides a trainer client based on the gt4sd-trainer CLI command. The trainer currently supports training pipelines for language modeling (language-modeling-trainer), PaccMann (paccmann-vae-trainer) and Granular (granular-trainer, multimodal compositional autoencoders).

$ gt4sd-trainer --help
usage: gt4sd-trainer [-h] --training_pipeline_name TRAINING_PIPELINE_NAME
                     [--configuration_file CONFIGURATION_FILE]

optional arguments:
  -h, --help            show this help message and exit
  --training_pipeline_name TRAINING_PIPELINE_NAME
                        Training type of the converted model, supported types:
                        granular-trainer, language-modeling-trainer, paccmann-
                        vae-trainer. (default: None)
  --configuration_file CONFIGURATION_FILE
                        Configuration file for the trainining. It can be used
                        to completely by-pass pipeline specific arguments.
                        (default: None)

To launch a training you have two options.

You can either specify the training pipeline and the path of a configuration file that contains the needed training parameters:

gt4sd-trainer  --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}

Or you can provide directly the needed parameters as argumentsL

gt4sd-trainer  --training_pipeline_name language-modeling-trainer --type mlm --model_name_or_path mlm --training_file /pah/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl 

To get more info on a specific training pipeleins argument simply type:

gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --help

Additional examples

Find more examples in notebooks

Supported packages

Beyond implementing various generative modeling inference and training pipelines GT4SD is designed to provide a high-level API that implement an harmonized interface for several existing packages:

  • GuacaMol: inference pipelines for the baselines models.
  • MOSES: inference pipelines for the baselines models.
  • TAPE: encoder modules compatible with the protein language models.
  • PaccMann: inference pipelines for all algorithms of the PaccMann family as well as traiing pipelines for the generative VAEs.
  • transformers: training and inference pipelines for generative models from the HuggingFace Models

References

If you use gt4sd in your projects, please consider citing the following:

@software{GT4SD,
author = {GT4SD Team},
month = {2},
title = {{GT4SD (Generative Toolkit for Scientific Discovery)}},
url = {https://github.com/GT4SD/gt4sd-core},
version = {main},
year = {2022}
}

License

The gt4sd codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gt4sd-0.23.0.tar.gz (135.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page