Skip to main content

AIDO.ModelGenerator is a software stack for adapting pretrained models and generating finetuned models for downstream tasks in an AI-driven Digital Organism (AIDO).

Project description

AIDO.ModelGenerator

AIDO.ModelGenerator is a software stack for adapting pretrained models and generating finetuned models for downstream tasks in an AI-driven Digital Organism (AIDO). To read more about AIDO.ModelGenerator's integral role in building the world's first AI-driven Digital Organism, see AIDO.

AIDO.ModelGenerator is open-sourced as an opinionated plug-and-play research framework for cross-disciplinary teams in ML & Bio. It is designed to enable rapid and reproducible prototyping with four kinds of experiments in mind:

  1. Applying pre-trained foundation models to new data
  2. Developing new finetuning and inference tasks for foundation models
  3. Benchmarking foundation models and creating leaderboards
  4. Testing new architectures for finetuning performance

while also scaling with hardware and integrating with larger data pipelines or research workflows.

AIDO.ModelGenerator is built on PyTorch, HuggingFace, and Lightning, and works seamlessly with these ecosystems.

See the AIDO.ModelGenerator documentation for installation, usage, tutorials, and API reference.

Who uses ModelGenerator?

🧬 Biologists

  • Intuitive one-command CLIs for in silico experiments
  • Pre-trained model zoo
  • Broad data compatibility
  • Pipeline-oriented workflows

🤖 ML Researchers

  • Reproducible-by-design experiments
  • Architecture A/B testing
  • Automatic hardware scaling
  • Integration with PyTorch, Lightning, HuggingFace, and WandB

☕ Software Engineers

  • Extensible and modular models, tasks, and data
  • Strict typing and documentation
  • Fail-fast interface design
  • Continuous integration and testing

🤝 Everyone benefits from

  • A collaborative hub and focal point for multidisciplinary work on experiments, models, software, and data
  • Community-driven development
  • Permissive license for academic and non-commercial use

Projects using AIDO.ModelGenerator

Installation

git clone https://github.com/genbio-ai/ModelGenerator.git
cd ModelGenerator
pip install -e .

Source installation is necessary to add new backbones, finetuning tasks, and data transformations, as well as use convenience configs and scripts. If you only need to run inference, reproduce published experiments, or finetune on new data, you can use

pip install modelgenerator
pip install git+https://github.com/genbio-ai/openfold.git@c4aa2fd0d920c06d3fd80b177284a22573528442
pip install git+https://github.com/NVIDIA/dllogger.git@0540a43971f4a8a16693a9de9de73c1072020769

Quick Start

Get embeddings from a pre-trained model

mgen predict --model Embed --model.backbone aido_dna_dummy \
  --data SequencesDataModule --data.path genbio-ai/100m-random-promoters \
  --data.x_col sequence --data.id_col sequence --data.test_split_size 0.0001 \
  --config configs/examples/save_predictions.yaml

Get token probabilities from a pre-trained model

mgen predict --model Inference --model.backbone aido_dna_dummy \
  --data SequencesDataModule --data.path genbio-ai/100m-random-promoters \
  --data.x_col sequence --data.id_col sequence --data.test_split_size 0.0001 \
  --config configs/examples/save_predictions.yaml

Finetune a model

mgen fit --model ConditionalDiffusion --model.backbone aido_dna_dummy \
  --data ConditionalDiffusionDataModule --data.path "genbio-ai/100m-random-promoters"

Evaluate a model checkpoint

mgen test --model ConditionalDiffusion --model.backbone aido_dna_dummy \
  --data ConditionalDiffusionDataModule --data.path "genbio-ai/100m-random-promoters" \
  --ckpt_path logs/lightning_logs/version_X/checkpoints/<your_model>.ckpt

Save predictions

mgen predict --model ConditionalDiffusion --model.backbone aido_dna_dummy \
  --data ConditionalDiffusionDataModule --data.path "genbio-ai/100m-random-promoters" \
  --ckpt_path logs/lightning_logs/version_X/checkpoints/<your_model>.ckpt \
  --config configs/examples/save_predictions.yaml

Configify your experiment

This command

mgen fit --model ConditionalDiffusion --model.backbone aido_dna_dummy \
  --data ConditionalDiffusionDataModule --data.path "genbio-ai/100m-random-promoters"

is equivalent to mgen fit --config my_config.yaml with

# my_config.yaml
model:
  class_path: ConditionalDiffusion
  init_args:
    backbone: aido_dna_dummy
data:
  class_path: ConditionalDiffusionDataModule
  init_args:
    path: "genbio-ai/100m-random-promoters"

Use composable configs to customize workflows

mgen fit --model SequenceRegression --data PromoterExpressionRegression \
  --config configs/defaults.yaml \
  --config configs/examples/lora_backbone.yaml \
  --config configs/examples/wandb.yaml

We provide some useful examples in configs/examples. Configs use the LAST value for each attribute. Check the full configuration logged with each experiment in logs/lightning_logs/your-experiment/config.yaml, or if using wandb logs/config.yaml.

Use LoRA for parameter-efficient finetuning

This also avoids saving the full model, only the LoRA weights are saved.

mgen fit --data PromoterExpressionRegression \
  --model SequenceRegression --model.backbone.use_peft true \
  --model.backbone.lora_r 16 \
  --model.backbone.lora_alpha 32 \
  --model.backbone.lora_dropout 0.1

Use continued pretraining for finetuning domain adaptation

First run pretraining objective on finetuning data

# https://arxiv.org/pdf/2310.02980
mgen fit --model MLM --model.backbone aido_dna_dummy \
  --data MLMDataModule --data.path leannmlindsey/GUE \
  --data.config_name prom_core_notata

Then finetune using the adapted model

mgen fit --model SequenceClassification --model.strict_loading false \
  --data SequenceClassificationDataModule --data.path leannmlindsey/GUE \
  --data.config_name prom_core_notata \
  --ckpt_path logs/lightning_logs/version_X/checkpoints/<your_adapted_model>.ckpt

Make sure to turn off strict_loading to replace the adapter!

Use the head/adapter/decoder that comes with the backbone

mgen fit --model SequenceClassification --data GUEClassification \
  --model.use_legacy_adapter true

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

modelgenerator-0.1.1.post3.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

modelgenerator-0.1.1.post3-py3-none-any.whl (336.5 kB view details)

Uploaded Python 3

File details

Details for the file modelgenerator-0.1.1.post3.tar.gz.

File metadata

  • Download URL: modelgenerator-0.1.1.post3.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for modelgenerator-0.1.1.post3.tar.gz
Algorithm Hash digest
SHA256 5b9b2da9cfe82ad15b9be6c97183f5a673993635b4d14a0308f2e231efa66024
MD5 7775742d5116be6c774e41256f77d042
BLAKE2b-256 aaa41a03051e07a2bcabf57445253304e735bf871e54f52fdc3bf11ca2895fa5

See more details on using hashes here.

Provenance

The following attestation bundles were made for modelgenerator-0.1.1.post3.tar.gz:

Publisher: publish.yml on genbio-ai/ModelGenerator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file modelgenerator-0.1.1.post3-py3-none-any.whl.

File metadata

File hashes

Hashes for modelgenerator-0.1.1.post3-py3-none-any.whl
Algorithm Hash digest
SHA256 4ab6cf38fc08e0dcd73c6135ccd94f75f8589d321d0d2b5a03f5b2c7b1613df8
MD5 81c025f76abd49c2413beb24f964a5cc
BLAKE2b-256 2ec93d38b804970432319f1db03167fa08833cd8f7c70adc2fe00dc5726a7f05

See more details on using hashes here.

Provenance

The following attestation bundles were made for modelgenerator-0.1.1.post3-py3-none-any.whl:

Publisher: publish.yml on genbio-ai/ModelGenerator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page