Skip to main content

Biological prediction models made simple.

Project description

Biotrainer

License Documentation GitHub release (latest by date)

biotrainer logo
Biological prediction models made simple.

Overview

Biotrainer is an open-source framework that simplifies machine learning model development for protein analysis. It provides:

  • Easy-to-use training and inference pipelines for protein feature prediction
  • Standardized data formats for various prediction tasks
  • Built-in support for protein language models and embeddings
  • Flexible configuration through simple YAML files

Quick Start

1. Installation

Install using pip:

pip install biotrainer

Manual installation using uv:

# First, install uv if you haven't already:
pip install uv

# Create and activate a virtual environment
uv venv
source .venv/bin/activate  # On Unix/macOS
# OR
.venv\Scripts\activate  # On Windows

# Basic installation
uv pip install -e .

# Installing with jupyter notebook support:
uv pip install -e ".[jupyter]"

# Installing with onnxruntime support (for onnx embedders and inference):
uv pip install -e ".[onnx-cpu]"    # CPU version
uv pip install -e ".[onnx-gpu]"    # CUDA version
uv pip install -e ".[onnx-mac]"    # CoreML version (for Apple Silicon)

# You can also combine extras:
uv pip install -e ".[jupyter,onnx-cpu]"

# For Windows users with CUDA support:
# Visit https://pytorch.org/get-started/locally/ and follow GPU-specific installation, e.g.:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

2. Basic Usage

# Training
biotrainer train --config examples/sequence_to_class/config.yml

# Inference
python3
>>> from biotrainer.inference import Inferencer
>>> inferencer, _ = Inferencer.create_from_out_file('output/out.yml')
>>> predictions = inferencer.from_embeddings(your_embeddings)

3. Quick Start Datasets

  • Secondary Structure Prediction
  • Subcellular Localization Prediction

Features

Supported Prediction Tasks

  • Residue-level classification (residue_to_class)
  • Residue-level regression (residue_to_value) [BETA]
  • Sequence-level classification (sequence_to_class)
  • Sequence-level regression (sequence_to_value)
  • Residues-level classification (residues_to_class, like sequence_to_class with per-residue embeddings)
  • Residues-level regression (residues_to_value, like sequence_to_value with per-residue embeddings)

Built-in Capabilities

  • Multiple embedding methods (ProtT5, ESM-2, ONNX, etc.)
  • Various neural network architectures
  • Cross-validation and model evaluation
  • Performance metrics and visualization
  • Sanity checks and automatic calculation of baselines (such as random, mean...)
  • Docker support for reproducible environments

Documentation

Tutorials

Detailed Guides

Example Configuration

protocol: residue_to_class
input_file: input.fasta
model_choice: CNN
optimizer_choice: adam
learning_rate: 1e-3
loss_choice: cross_entropy_loss
use_class_weights: True
num_epochs: 200
batch_size: 128
embedder_name: Rostlab/prot_t5_xl_uniref50

Docker Support

# Run using pre-built image
docker run --gpus all --rm \
    -v "$(pwd)/examples/docker":/mnt \
    -u $(id -u ${USER}):$(id -g ${USER}) \
    ghcr.io/sacdallago/biotrainer:latest /mnt/config.yml

More information on running docker with gpus: Nvidia container toolkit

Getting Help

Citation

@inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

biotrainer-1.1.0.tar.gz (389.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

biotrainer-1.1.0-py3-none-any.whl (144.5 kB view details)

Uploaded Python 3

File details

Details for the file biotrainer-1.1.0.tar.gz.

File metadata

  • Download URL: biotrainer-1.1.0.tar.gz
  • Upload date:
  • Size: 389.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for biotrainer-1.1.0.tar.gz
Algorithm Hash digest
SHA256 a60ba93869532b7560cf65c08b0474e58200bc89f4899c26faccee596af084ab
MD5 0720a3599b579072df228a38dd8a6745
BLAKE2b-256 c9243f3e1891576daf0cdac0c5d7cb70ad42f43adbf02a2763ab7f31820629e9

See more details on using hashes here.

File details

Details for the file biotrainer-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: biotrainer-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 144.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for biotrainer-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f685c82ff26e91e3109fe42e03d83a19e0b8f11cfeb8030ad6fb657945467b1
MD5 23116cde300ed89dabacbfa1a79f0ebc
BLAKE2b-256 1b582e7db8985f1ae2e84d67f7f1066d6c6edd6c70c2d20d1e028606203ac873

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page