Biological prediction models made simple.
Project description
Biotrainer
Biological prediction models made simple.
Overview
Biotrainer is an open-source framework that simplifies machine learning model development for protein analysis. It provides:
- Easy-to-use training and inference pipelines for protein feature prediction
- Standardized data formats for various prediction tasks
- Built-in support for protein language models and embeddings
- Flexible configuration through simple YAML files
Quick Start
1. Installation
Install using pip:
pip install biotrainer
Manual installation using uv:
# First, install uv if you haven't already:
pip install uv
# Create and activate a virtual environment
uv venv
source .venv/bin/activate # On Unix/macOS
# OR
.venv\Scripts\activate # On Windows
# Basic installation
uv pip install -e .
# Installing with jupyter notebook support:
uv pip install -e ".[jupyter]"
# Installing with onnxruntime support (for onnx embedders and inference):
uv pip install -e ".[onnx-cpu]" # CPU version
uv pip install -e ".[onnx-gpu]" # CUDA version
uv pip install -e ".[onnx-mac]" # CoreML version (for Apple Silicon)
# You can also combine extras:
uv pip install -e ".[jupyter,onnx-cpu]"
# For Windows users with CUDA support:
# Visit https://pytorch.org/get-started/locally/ and follow GPU-specific installation, e.g.:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
2. Basic Usage
# Training
biotrainer train --config examples/sequence_to_class/config.yml
# Inference
python3
>>> from biotrainer.inference import Inferencer
>>> inferencer, _ = Inferencer.create_from_out_file('output/out.yml')
>>> predictions = inferencer.from_embeddings(your_embeddings)
3. Quick Start Datasets
- Subcellular Localization Prediction
- Protocol:
sequence_to_class/residues_to_class - Citations and Download
- Protocol:
- Secondary Structure Prediction
- Protocol:
residue_to_class - Citations and Download
- Protocol:
Features
Supported Prediction Tasks
- Residue-level classification (
residue_to_class) - Residue-level regression (
residue_to_value) [BETA] - Sequence-level classification (
sequence_to_class) - Sequence-level regression (
sequence_to_value) - Residues-level classification (
residues_to_class, like sequence_to_class with per-residue embeddings) - Residues-level regression (
residues_to_value, like sequence_to_value with per-residue embeddings)
Built-in Capabilities
- Multiple embedding methods (ProtT5, ESM-2, ONNX, etc.)
- Various neural network architectures
- Cross-validation and model evaluation
- Performance metrics and visualization
- Sanity checks and automatic calculation of baselines (such as random, mean...)
- Docker support for reproducible environments
Autoeval
The biotrainer autoeval module allows automatical evaluation of a protein language model on downstream tasks.
You can find public results (wip!) on the autoeval dashboard and compare them
to your own. Learn more in the docs or in the autoeval examples.
Documentation
Tutorials
Detailed Guides
Example Configuration
protocol: residue_to_class
input_file: input.fasta
model_choice: CNN
optimizer_choice: adam
learning_rate: 1e-3
loss_choice: cross_entropy_loss
use_class_weights: True
num_epochs: 200
batch_size: 128
embedder_name: Rostlab/prot_t5_xl_uniref50
Docker Support
# Run using pre-built image
docker run --gpus all --rm \
-v "$(pwd)/examples/docker":/mnt \
-u $(id -u ${USER}):$(id -g ${USER}) \
ghcr.io/sacdallago/biotrainer:latest /mnt/config.yml
More information on running docker with gpus: Nvidia container toolkit
Getting Help
- Check our Troubleshooting Guide
- Create an issue
- Visit biocentral.cloud
Citation
@inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file biotrainer-1.4.0.tar.gz.
File metadata
- Download URL: biotrainer-1.4.0.tar.gz
- Upload date:
- Size: 509.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2b77b4ffc6ce7651a3a34a39688871057a1b84b8f8cf526c9744cb1b69f0507
|
|
| MD5 |
62586263e94c21c0b697e82981bb5934
|
|
| BLAKE2b-256 |
e3274188fff103f80e1ffac9d53c3faca2ba8965879e05d3eedc85bd23805fa6
|
File details
Details for the file biotrainer-1.4.0-py3-none-any.whl.
File metadata
- Download URL: biotrainer-1.4.0-py3-none-any.whl
- Upload date:
- Size: 229.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48e6b9f0353304cbd502f26ddd8fb8bbdb185dc3b8bb3688cfd2c38bf93d4691
|
|
| MD5 |
866204fe75e1391a9119d460054ffbb7
|
|
| BLAKE2b-256 |
31a8a1fdabec4464ae6a54d7bb478760700c7dbba0466b14ec9c8cd9141a0560
|