Biomolecular emulator
Project description
Biomolecular Emulator (BioEmu)
Biomolecular Emulator (BioEmu for short) is a model that samples from the approximated equilibrium distribution of structures for a protein monomer, given its amino acid sequence.
For more information see our paper, citation below.
This repository contains inference code and model weights.
Table of Contents
- Installation
- Sampling structures
- Steering to avoid chain breaks and clashes
- Azure AI Foundry
- Training data
- Get in touch
- Citation
Installation
bioemu is provided as a Linux-only pip-installable package. We support Python 3.10 and above:
pip install bioemu
To install with CUDA support:
pip install bioemu[cuda]
[!NOTE] BioEmu uses an inlined version of ColabFold and AlphaFold2 for MSA retrieval and embedding generation. These are bundled with the package — no separate environment or installation is needed. On first use, AlphaFold2 model weights (~3.5 GB) will be automatically downloaded to
~/.cache/colabfold/.
Sampling structures
You can sample structures for a given protein sequence using the sample module. To run a tiny test using the default model parameters and denoising settings:
python -m bioemu.sample --sequence GYDPETGTWG --num_samples 10 --output_dir ~/test-chignolin
Alternatively, you can use the Python API:
from bioemu.sample import main as sample
sample(sequence='GYDPETGTWG', num_samples=10, output_dir='~/test_chignolin')
The model parameters will be automatically downloaded from huggingface. A path to a single-sequence FASTA file can also be passed to the sequence argument.
Sampling times will depend on sequence length and available infrastructure. The following table gives times for collecting 1000 samples measured on an A100 GPU with 80 GB VRAM for sequences of different lengths (using a batch_size_100=20 setting in sample.py):
| sequence length | time / min |
|---|---|
| 100 | 4 |
| 300 | 40 |
| 600 | 150 |
By default, unphysical structures (steric clashes or chain discontinuities) will be filtered out, so you will typically get fewer samples in the output than requested. The difference can be very large if your protein has large disordered regions which are very likely to produce clashes. If you want to get all generated samples in the output, irrespective of whether they are physically valid, use the --filter_samples=False argument.
[!NOTE] If you wish to use your own generated MSA instead of the ones retrieved via the ColabFold MMseqs2 server, you can pass an A3M file containing the query sequence as the first row to the
sequenceargument. Additionally, themsa_host_urlargument can be used to override the default MSA query server. See sample.py for more options.
This code only supports sampling structures of monomers. You can try to sample multimers using the linker trick, but in our limited experiments, this has not worked well.
Steering to avoid chain breaks and clashes
BioEmu includes a steering system that uses Sequential Monte Carlo (SMC) to guide the diffusion process toward more physically plausible protein structures. Empirically, using three (or up to 10) steering particles per output sample greatly reduces the number of unphysical samples (steric clashes or chain breaks) produced by the model. Steering applies potential energy functions during denoising to favor conformations that satisfy physical constraints. Algorithmically, steering simulates multiple candidate samples per desired output sample and resamples between these particles according to the favorability of the provided potentials.
Quick start with steering
Enable steering with physical constraints using the CLI:
python -m bioemu.sample \
--sequence GYDPETGTWG \
--num_samples 100 \
--output_dir ~/steered-samples \
--steering_config src/bioemu/config/steering/physical_steering.yaml \
--denoiser_config src/bioemu/config/denoiser/stochastic_dpm.yaml
Or using the Python API:
from bioemu.sample import main as sample
sample(
sequence='GYDPETGTWG',
num_samples=100,
output_dir='~/steered-samples',
denoiser_config="../src/bioemu/config/denoiser/stochastic_dpm.yaml", # Use stochastic DPM
steering_config="../src/bioemu/config/steering/physicality_steering.yaml", # Use physicality steering
)
Key steering parameters
num_steering_particles: Number of particles per sample (1 = no steering, >1 enables steering)steering_start_time: When to start steering (0.0-1.0, default: 0.1) with reverse sampling 1 -> 0steering_end_time: When to stop steering (0.0-1.0, default: 0.) with reverse sampling 1 -> 0resampling_interval: How often to resample particles (default: 1)steering_config: Path to potentials configuration file (required for steering)
Available potentials
The physical_steering.yaml configuration provides potentials for physical realism:
- ChainBreak: Prevents backbone discontinuities
- ChainClash: Avoids steric clashes between non-neighboring residues
You can create a custom steering_config.yaml YAML file instantiating your own potential to steer the system with your own potentials.
Azure AI Foundry
BioEmu is also available on Azure AI Foundry. See How to run BioEmu on Azure AI Foundry for more details.
Training data
The molecular dynamics training data used for BioEmu is available on Zenodo:
For a full description of these, see the paper.
Reproducing results from the paper
You can use this code together with code from bioemu-benchmarks to approximately reproduce results from our paper.
- The
bioemu-v1.0checkpoint contains the model weights used to produce the results in the preprint. Due to simplifications made in the embedding computation and a more efficient sampler, the results obtained with this code are not identical but consistent with the preprint statistics, i.e., mode coverage and free energy errors averaged over the proteins in a test set. Results for individual proteins may differ. - [Default] The
bioemu-v1.1checkpoint contains the model weights used to produce the results in the published Science paper. - The
bioemu-v1.2checkpoint contains the model weights trained from an extended set of MD simulations and experimental measurements of folding free energies.
For more details, please check the BIOEMU_RESULTS.md document on the bioemu-benchmarks repository.
To use a specific checkpoint, you can specify the model_name in the bioemu.sample args, for example, --model_name="bioemu-v1.1".
Side-chain reconstruction and MD-relaxation
BioEmu outputs structures in backbone frame representation. To reconstruct the side-chains, several tools are available. As an example, we interface with HPacker to conduct side-chain reconstruction, and also provide basic tooling for running a short molecular dynamics (MD) equilibration.
[!WARNING] Side-chain reconstruction relies on HPacker which requires a conda-based package manager. Make sure that
condais in yourPATHand that you have CUDA12-compatible drivers before running the following code. Note thatcondais not required for BioEmu's core sampling functionality.
Install optional dependencies:
pip install bioemu[md]
You can compute side-chain reconstructions via the bioemu.sidechains_relax module:
python -m bioemu.sidechain_relax --pdb-path path/to/topology.pdb --xtc-path path/to/samples.xtc
[!NOTE] The first time this module is invoked, it will attempt to install
hpackerand its dependencies into a separatehpackerconda environment. If you wish for it to be installed in a different location, please set theHPACKER_ENV_NAMEenvironment variable before using this module for the first time.
By default, side-chain reconstruction and local energy minimization are performed (no full MD integration for efficiency reasons). Note that the runtime of this code scales with the size of the system. We suggest running this code on a selection of samples rather than the full set.
There are two other options:
- To only run side-chain reconstruction without MD equilibration, add
--no-md-equil. - To run a short NVT equilibration (0.1 ns), add
--md-protocol nvt_equil
To see the full list of options, call python -m bioemu.sidechain_relax --help.
The script saves reconstructed all-heavy-atom structures in samples_sidechain_rec.{pdb,xtc} and MD-equilibrated structures in samples_md_equil.{pdb,xtc} (filename to be altered with --outname other_name).
Third-party code
- The code in
src/bioemu/openfold/is copied from OpenFold (Apache 2.0) with minor modifications described in the relevant source files. - The code in
src/_vendor/alphafold/is a vendored, patched subset of AlphaFold2 v2.3.2 (Apache 2.0). See src/_vendor/alphafold/README.md for details on the modifications. - The code in
src/bioemu/colabfold_inline/contains functions derived from ColabFold v1.5.4 (MIT). See the license headers in each file for details.
Get in touch
If you have any questions not covered here, please create an issue or contact the BioEmu team by writing to the corresponding author on our paper.
Citation
If you are using our code or model, please cite the following paper:
@article{bioemu2025,
title={Scalable emulation of protein equilibrium ensembles with generative deep learning},
author={Lewis, Sarah and Hempel, Tim and Jim{\'e}nez-Luna, Jos{\'e} and Gastegger, Michael and Xie, Yu and Foong, Andrew YK and Satorras, Victor Garc{\'\i}a and Abdin, Osama and Veeling, Bastiaan S and Zaporozhets, Iryna and Chen, Yaoyi and Yang, Soojung and Foster, Adam E. and Schneuing, Arne and Nigam, Jigyasa and Barbero, Federico and Stimper Vincent and Campbell, Andrew and Yim, Jason and Lienen, Marten and Shi, Yu and Zheng, Shuxin and Schulz, Hannes and Munir, Usman and Sordillo, Roberto and Tomioka, Ryota and Clementi, Cecilia and No{\'e}, Frank},
journal={Science},
pages={eadv9817},
year={2025},
publisher={American Association for the Advancement of Science},
doi={10.1126/science.adv9817}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bioemu-1.3.0.tar.gz.
File metadata
- Download URL: bioemu-1.3.0.tar.gz
- Upload date:
- Size: 186.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15e9d3c7b0cecda7d1c3505e8a76c9664c777d5cb70148f5daa71eb13a153e25
|
|
| MD5 |
bacdbc2dd084df6850e86f8ea8ada276
|
|
| BLAKE2b-256 |
d0106dfb3fa364052d6dac4b817d6eedcc54406eef4f4d2fb33095cefc1708f3
|
Provenance
The following attestation bundles were made for bioemu-1.3.0.tar.gz:
Publisher:
publish.yml on microsoft/bioemu
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bioemu-1.3.0.tar.gz -
Subject digest:
15e9d3c7b0cecda7d1c3505e8a76c9664c777d5cb70148f5daa71eb13a153e25 - Sigstore transparency entry: 1199749605
- Sigstore integration time:
-
Permalink:
microsoft/bioemu@2aa054fb99f15a0a96d34153e3e1eb7957bf7562 -
Branch / Tag:
refs/tags/v.1.3.0 - Owner: https://github.com/microsoft
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2aa054fb99f15a0a96d34153e3e1eb7957bf7562 -
Trigger Event:
release
-
Statement type:
File details
Details for the file bioemu-1.3.0-py3-none-manylinux1_x86_64.whl.
File metadata
- Download URL: bioemu-1.3.0-py3-none-manylinux1_x86_64.whl
- Upload date:
- Size: 195.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
194edb731057cd9716e04dde13ae61d664c84aa1ddb3af02b0a6814d8f34241c
|
|
| MD5 |
c5bdc8f65e03cc22cc9fc193aab1d66e
|
|
| BLAKE2b-256 |
32ca4e2fb577c9d266cc3ed10e3919ac9a36c1ec2b006cf81edf8ac44bbed96d
|
Provenance
The following attestation bundles were made for bioemu-1.3.0-py3-none-manylinux1_x86_64.whl:
Publisher:
publish.yml on microsoft/bioemu
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bioemu-1.3.0-py3-none-manylinux1_x86_64.whl -
Subject digest:
194edb731057cd9716e04dde13ae61d664c84aa1ddb3af02b0a6814d8f34241c - Sigstore transparency entry: 1199749607
- Sigstore integration time:
-
Permalink:
microsoft/bioemu@2aa054fb99f15a0a96d34153e3e1eb7957bf7562 -
Branch / Tag:
refs/tags/v.1.3.0 - Owner: https://github.com/microsoft
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2aa054fb99f15a0a96d34153e3e1eb7957bf7562 -
Trigger Event:
release
-
Statement type: