Skip to main content

Modeling bacterial genomes.

Project description

Bacformer

Bacformer is a prokaryotic foundational model which models whole-bacterial genomes as a sequence of proteins ordered by their genomic coordinates on the chromosome and plasmid(s). It takes as input average protein embeddings from protein language models and computes contextualised protein embeddings conditional on other proteins present in the genome. Bacformer is trained on a diverse dataset of ~1.3M bacterial genomes and ~3B proteins.

Bacformer

Bacformer can be applied to a wide range of tasks, including: strain clustering, essential genes prediction, operon identification, ppi prediction, protein function prediction and more. We provide model checkpoints for pretrained models as well as Bacformer finetuned for various tasks. We also provide tutorials and make Bacformer available via HuggingFace.

News

  • 2025-01-20: Released Bacformer Large models (complete genomes and MAGs), a 300M parameter model with much improved performance on downstream tasks.
  • 2025-01-20: Released BacBench, a framework for embedding bacterial genomes with genomic language models and evaluating their performance on downstream tasks.
  • 2025-11-21: Bacformer won the AI x Bio hackathon organised by Evolved Technology 🎉.
  • 2025-07-21: Bacformer preprint is now available on biorxiv.
  • 2025-05-15: Bacformer is now available on HuggingFace.

Contents

Setup

Requirements

Bacformer is based on PyTorch and HuggingFace Transformers and was developed in python=3.10.

Bacformer uses protein embeddings as input, leveraging pretrained protein language models:

  • Bacformer (26M parameters) uses ESM-2 (esm2_t12_35M_UR50D)
  • Bacformer Large (300M parameters) uses ESM-C (Synthyra/ESMplusplus_small)

We recommend using the faplm package to compute protein embeddings in a fast and efficient way.

Note: ESM++ is a faithful implementation of ESM-C (license)

Installation

We recommend to firstly install PyTorch 2.2 or above (pip install "torch>=2.2"), then installing flash attention (pip install flash-attn --no-build-isolation), and finally installing bacformer via pip:

pip install bacformer

or by cloning the repository and installing the dependencies:

git clone https://github.com/macwiatrak/Bacformer.git
cd Bacformer
# 1) install Bacformer **with its core dependencies**
pip install .
# 2) (optional but recommended) add the fast‐attention extra (“faesm”)
pip install ".[faesm]"
Have trouble installing?

create clean conda env, and install the cuda-toolkit 12.1.0 for compilation:

# Create new environment with Python 3.10
micromamba create -n bacformer python=3.10 -y

# Activate the environment
micromamba activate bacformer

# Install CUDA toolkit
micromamba install -c nvidia/label/cuda-12.1.0 cuda-toolkit -y

# Install PyTorch with CUDA support (using pip for latest version)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install flash-attention
pip install flash-attn --no-build-isolation --no-cache-dir

# Optional: verify installations
python -c "import torch; print(f'PyTorch version: {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# install faesm package for fast ESM-2 embeddings (recommended)
pip install faesm[flash_attn]

# finally, install the bacformer package
pip install bacformer

Another workaround is docker container. You can use the official nvidia pytorch containers which have all the dependencies for flash attention.

Usage

Below are examples on how to use Bacformer to compute contextual protein embeddings.

Computing contextual protein embeddings on a set of toy protein sequences

import torch
from transformers import AutoModel
from bacformer.pp import protein_seqs_to_bacformer_inputs

device = "cuda:0"
model = AutoModel.from_pretrained(
    "macwiatrak/bacformer-large-masked-MAG", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)

# Example input: a sequence of protein sequences
# in this case: 4 toy protein sequences
# Bacformer was trained with a maximum nr of proteins of 6000.
protein_sequences = [
    "MGYDLVAGFQKNVRTI",
    "MKAILVVLLG",
    "MQLIESRFYKDPWGNVHATC",
    "MSTNPKPQRFAWL",
]
# embed the proteins with ESM-2 to get average protein embeddings
inputs = protein_seqs_to_bacformer_inputs(
    protein_sequences,
    device=device,
    batch_size=128,  # the batch size for computing the protein embeddings
    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with
    bacformer_model_type="large", # must be equal to "large" (Bacformer Large 300M) or "base" (Bacformer 26M)
)

# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

# (batch_size, n_prots + special tokens or max_n_proteins, embedding_dim)
print('last hidden state shape:', outputs["last_hidden_state"].shape)
# (batch_size, embedding_dim)
print('genome embedding:', outputs.last_hidden_state.mean(dim=1).shape)

Encoding contig/chromosome/plasmid information

Bacformer encodes the contig/chromosome/plasmid information by adding the contig_embedding. For the model to account for it pass the contig ID of each protein as input. By default Bacformer incorporates the contig information whenever available.

NOTE: Every input protein sequence must belong to a contig, otherwise they will be treated as belonging to a single contig.

import torch
from transformers import AutoModel
from bacformer.pp import protein_seqs_to_bacformer_inputs

device = "cuda:0"
model = AutoModel.from_pretrained(
    "macwiatrak/bacformer-large-masked-MAG", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)

# Example input: a sequence of protein sequences
# in this case: 4 toy protein sequences
# Bacformer was trained with a maximum nr of proteins of 6000.
protein_sequences = [
    "MGYDLVAGFQKNVRTI",
    "MKAILVVLLG",
    "MQLIESRFYKDPWGNVHATC",
    "MSTNPKPQRFAWL",
]
# contig IDs for each protein sequence
contig_ids = ["contig_1", "contig_1", "contig_2", "contig_3"]
# NOTE: equivalently, you can indicate the contig IDs by representing the protein_sequences list as a nested list
# i.e. protein_sequences = [["MGYDLVAGFQKNVRTI", "MKAILVVLLG"], ["MQLIESRFYKDPWGNVHATC"], ["MSTNPKPQRFAWL"]]
# and skipping the contig_ids argument below (defaults to None)

# embed the proteins with ESM-2 to get average protein embeddings
inputs = protein_seqs_to_bacformer_inputs(
    protein_sequences,
    contig_ids=contig_ids,
    device=device,
    batch_size=128,  # the batch size for computing the protein embeddings
    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with
    bacformer_model_type="large",
)

# contig_ids represent the contig information for each protein
print(inputs['contig_ids'])
# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

Processing and embedding whole bacterial genome

Process a whole bacterial genome assembly from GenBank (in this case, Pseudomonas aeruginosa PAO1 genome) and compute contextualised protein embeddings with Bacformer.

import torch
from transformers import AutoModel
from bacformer.pp import preprocess_genome_assembly, protein_seqs_to_bacformer_inputs


# preprocess a bacterial genome assembly
genome_info = preprocess_genome_assembly(filepath="files/pao1.gbff")

# load the model
device = "cuda:0"
model = AutoModel.from_pretrained(
    "macwiatrak/bacformer-large-masked-complete-genomes", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)


# embed the proteins with a portein language model to get average protein embeddings, takes <1min on A100 GPU
inputs = protein_seqs_to_bacformer_inputs(
    genome_info['protein_sequence'],
    device=device,
    batch_size=128,  # the batch size for computing the protein embeddings
    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with
    bacformer_model_type="large", # must be equal to "large" (Bacformer Large 300M) or "base" (Bacformer 26M)
)

# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

# the resulting contextalized protein embeddings can be used for analysis
print('last hidden state shape:', outputs["last_hidden_state"].shape)

Embed dataset column with Bacformer

Use Bacformer to embed a column of protein sequences from a HuggingFace dataset. The example below can be easily adapted to a pandas DataFrame or any other data structure containing protein sequences.

Below we show how to compute contextualised protein embeddings for all proteins present in the genome required for operon prediction, or how to compute a single genome embedding for a set of genomes required for strain clustering.

from bacformer.pp import embed_dataset_col
from datasets import load_dataset


# load the operon dataset from long-read RNA sequencing
operon_dataset = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing", split="test")

# embed the protein sequences with Bacformer
# we compute contextualised protein embeddings for all proteins in the genome
operon_dataset = embed_dataset_col(
    dataset=operon_dataset,
    model_path="macwiatrak/bacformer-masked-complete-genomes",
    max_n_proteins=9000,
    genome_pooling_method=None,  # set to None to get embeddings for all proteins in the genome
    model_type="bacformer",  # for Bacformer 26M model, use "bacformer_large" for Bacformer Large 300M
)


# load the strain clustering toy dataset
strain_clustering_dataset = load_dataset("macwiatrak/strain-clustering-protein-sequences-sample", split="train")

# embed the protein sequences with Bacformer
# use mean genome pooling as we need a single genome embedding for each genome for clustering
strain_clustering_dataset = embed_dataset_col(
    dataset=strain_clustering_dataset,
    model_path="macwiatrak/bacformer-large-masked-MAG",
    max_n_proteins=9000,
    genome_pooling_method="mean",
    model_type="bacformer_large",  # for Bacformer 300M model, use "bacformer" for Bacformer Large 26M
)

# convert to pandas and print the first 5 rows
strain_clustering_df = strain_clustering_dataset.to_pandas()
strain_clustering_df.head()

Tutorials

We provide a set of tutorials to help you get started with Bacformer. The tutorials cover the following topics:

We are actively working on more tutorials, so stay tuned! If you have any suggestions for tutorials, please let us know by raising an issue in the issue tracker.

HuggingFace

Bacformer is integrated with HuggingFace.

import torch
from transformers import AutoModel, AutoModelForMaskedLM, AutoModelForCausalLM

device = "cuda:0"
# load the Bacformer model trained on MAGs with an autoregressive objective
masked_large_model = AutoModelForMaskedLM.from_pretrained("macwiatrak/bacformer-large-masked-complete-genomes", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

# load the Bacformer model trained on MAGs with an autoregressive objective
causal_model = AutoModelForCausalLM.from_pretrained("macwiatrak/bacformer-causal-MAG", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

# load the Bacformer model trained on MAGs with a masked objective
masked_model = AutoModelForMaskedLM.from_pretrained("macwiatrak/bacformer-masked-MAG", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

# load the Bacformer encoder model finetuned on complete genomes (i.e. without the protein family classification head)
# we recommend using this model for complete genomes as a start for finetuning on your own dataset for all tasks except generation
encoder_model = AutoModel.from_pretrained("macwiatrak/bacformer-large-masked-complete-genomes", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

Pretrained model checkpoints

We provide a range of pretrained model checkpoints for Bacformer which are available via HuggingFace.

Checkpoint name Genome type Description
bacformer-large-masked-MAG MAG A 300M parameter model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of the proteins.
bacformer-large-masked-complete-genomes Complete (i.e. uninterrupted) A 300M parameter model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of the proteins.
bacformer-causal-MAG MAG A 26M parameter model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with an autoregressive objective.
bacformer-masked-MAG MAG A 26M parameter model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of proteins.
bacformer-causal-complete-genomes Complete (i.e. uninterrupted) A bacformer-causal-MAG finetuned on a set of ~40 k complete genomes with an autoregressive objective.
bacformer-masked-complete-genomes Complete (i.e. uninterrupted) A bacformer-masked-MAG finetuned on a set of ~40 k complete genomes with a masked objective, randomly masking 15 % of the proteins.
bacformer-causal-protein-family-modeling-complete-genomes Complete (i.e. uninterrupted) A bacformer-causal-MAG finetuned on a set of ~40 k complete genomes with an autoregressive objective. In contrast to other models, this model takes as input a protein-family token rather than the protein sequence, allowing generation of sequences of protein families.
bacformer-for-essential-genes-prediction Complete (i.e. uninterrupted) A bacformer-masked-complete-genomes finetuned on the essential-genes prediction task.

Contributing

We welcome contributions to Bacformer! If you would like to contribute, please follow these steps:

  1. Fork the repository.
  2. Install pre-commit and set up the pre-commit hooks (make sure to do it at the root of the repository).
pip install pre-commit
pre-commit install
  1. Create a new branch for your feature or bug fix.
  2. Make your changes and commit them.
  3. Push your changes to your forked repository.
  4. Create a pull request to the main repository.
  5. Make sure to add tests for your changes and run the tests to ensure everything is working correctly.

Contact

For questions, bugs, and feature requests, please raise an issue in the repository.

Citation

@article{Wiatrak2025.07.20.665723,
	author = {Wiatrak, Maciej and Vi{\~n}as Torn{\'e}, Ramon and Ntemourtsidou, Maria and Dinan, Adam M. and Abelson, David C. and Arora, Divya and Brbi{\'c}, Maria and Weimann, Aaron and Floto, Rodrigo Andres},
	title = {A contextualised protein language model reveals the functional syntax of bacterial evolution},
	elocation-id = {2025.07.20.665723},
	year = {2025},
	doi = {10.1101/2025.07.20.665723},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/07/20/2025.07.20.665723},
	eprint = {https://www.biorxiv.org/content/early/2025/07/20/2025.07.20.665723.full.pdf},
	journal = {bioRxiv}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bacformer-0.2.0.tar.gz (24.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bacformer-0.2.0-py3-none-any.whl (60.9 kB view details)

Uploaded Python 3

File details

Details for the file bacformer-0.2.0.tar.gz.

File metadata

  • Download URL: bacformer-0.2.0.tar.gz
  • Upload date:
  • Size: 24.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.17

File hashes

Hashes for bacformer-0.2.0.tar.gz
Algorithm Hash digest
SHA256 1b6f241f2eeb4cb6af7e789b6a03483dfe235d71a45f5926beb79a7439b78992
MD5 8f02822a97f654cf252af691c270801e
BLAKE2b-256 130b61faed2a6e28fcdd8114d0a9c4c3d9f4a2d5cd0be063a88ed822dc3f352c

See more details on using hashes here.

File details

Details for the file bacformer-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: bacformer-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 60.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.17

File hashes

Hashes for bacformer-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1671cb328d2413d54ccb5915bec3ab0f0efebb5d04d752a7f2a228828d6dd2c6
MD5 74e0e2a5306af60cc752d4138f922ca4
BLAKE2b-256 95cfe3795bb123b0f23fb5c7a535ea8b65e4f45050cce890d50772e4a1284f38

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page