Modeling bacterial genomes.

These details have not been verified by PyPI

Project links

Project description

Bacformer

Bacformer is a prokaryotic foundational model which models whole-bacterial genomes as a sequence of proteins ordered by their genomic coordinates on the chromosome and plasmid(s). It takes as input average protein embeddings from protein language models and computes contextualised protein embeddings conditional on other proteins present in the genome. Bacformer is trained on a diverse dataset of ~1.3M bacterial genomes and ~3B proteins.

Bacformer

Bacformer can be applied to a wide range of tasks, including: strain clustering, essential genes prediction, operon identification, ppi prediction, protein function prediction and more. We provide model checkpoints for pretrained models as well as Bacformer finetuned for various tasks. We also provide tutorials and make Bacformer available via HuggingFace.

News

2025-01-20: Released Bacformer Large models (complete genomes and MAGs), a 300M parameter model with much improved performance on downstream tasks.
2025-01-20: Released BacBench, a framework for embedding bacterial genomes with genomic language models and evaluating their performance on downstream tasks.
2025-11-21: Bacformer won the AI x Bio hackathon organised by Evolved Technology 🎉.
2025-07-21: Bacformer preprint is now available on biorxiv.
2025-05-15: Bacformer is now available on HuggingFace.

Setup
- Requirements
- installation
Usage
- Tutorials
HuggingFace
Pretrained model checkpoints
Contributing
Citation
Installation
Release notes
Contact

Setup

Requirements

Bacformer is based on PyTorch and HuggingFace Transformers and was developed in python=3.10.

Bacformer uses protein embeddings as input, leveraging pretrained protein language models:

Bacformer (26M parameters) uses ESM-2 (esm2_t12_35M_UR50D)
Bacformer Large (300M parameters) uses ESM-C (Synthyra/ESMplusplus_small)

We recommend using the faplm package to compute protein embeddings in a fast and efficient way.

Note: ESM++ is a faithful implementation of ESM-C (license)

Installation

We recommend to firstly install PyTorch 2.2 or above (pip install "torch>=2.2"), then installing flash attention (pip install flash-attn --no-build-isolation), and finally installing bacformer via pip:

pip install bacformer

or by cloning the repository and installing the dependencies:

git clone https://github.com/macwiatrak/Bacformer.git
cd Bacformer
# 1) install Bacformer **with its core dependencies**
pip install .
# 2) (optional but recommended) add the fast‐attention extra (“faesm”)
pip install ".[faesm]"

Have trouble installing?

create clean conda env, and install the cuda-toolkit 12.1.0 for compilation:

# Create new environment with Python 3.10
micromamba create -n bacformer python=3.10 -y

# Activate the environment
micromamba activate bacformer

# Install CUDA toolkit
micromamba install -c nvidia/label/cuda-12.1.0 cuda-toolkit -y

# Install PyTorch with CUDA support (using pip for latest version)
pip install torch --index-url https://download.pytorch.org/whl/cu121

# Install flash-attention
pip install flash-attn --no-build-isolation --no-cache-dir

# Optional: verify installations
python -c "import torch; print(f'PyTorch version: {torch.__version__}')"
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

# install faesm package for fast ESM-2 embeddings (recommended)
pip install faesm[flash_attn]

# finally, install the bacformer package
pip install bacformer

Another workaround is docker container. You can use the official nvidia pytorch containers which have all the dependencies for flash attention.

Usage

Below are examples on how to use Bacformer to compute contextual protein embeddings.

Computing contextual protein embeddings on a set of toy protein sequences

import torch
from transformers import AutoModel
from bacformer.pp import protein_seqs_to_bacformer_inputs

device = "cuda:0"
model = AutoModel.from_pretrained(
    "macwiatrak/bacformer-large-masked-MAG", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)

# Example input: a sequence of protein sequences
# in this case: 4 toy protein sequences
# Bacformer was trained with a maximum nr of proteins of 6000.
protein_sequences = [
    "MGYDLVAGFQKNVRTI",
    "MKAILVVLLG",
    "MQLIESRFYKDPWGNVHATC",
    "MSTNPKPQRFAWL",
]
# embed the proteins with ESM-2 to get average protein embeddings
inputs = protein_seqs_to_bacformer_inputs(
    protein_sequences,
    device=device,
    batch_size=128,  # the batch size for computing the protein embeddings
    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with
    bacformer_model_type="large", # must be equal to "large" (Bacformer Large 300M) or "base" (Bacformer 26M)
)

# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

# (batch_size, n_prots + special tokens or max_n_proteins, embedding_dim)
print('last hidden state shape:', outputs["last_hidden_state"].shape)
# (batch_size, embedding_dim)
print('genome embedding:', outputs.last_hidden_state.mean(dim=1).shape)

Encoding contig/chromosome/plasmid information

Bacformer encodes the contig/chromosome/plasmid information by adding the contig_embedding. For the model to account for it pass the contig ID of each protein as input. By default Bacformer incorporates the contig information whenever available.

NOTE: Every input protein sequence must belong to a contig, otherwise they will be treated as belonging to a single contig.

import torch
from transformers import AutoModel
from bacformer.pp import protein_seqs_to_bacformer_inputs

device = "cuda:0"
model = AutoModel.from_pretrained(
    "macwiatrak/bacformer-large-masked-MAG", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)

# Example input: a sequence of protein sequences
# in this case: 4 toy protein sequences
# Bacformer was trained with a maximum nr of proteins of 6000.
protein_sequences = [
    "MGYDLVAGFQKNVRTI",
    "MKAILVVLLG",
    "MQLIESRFYKDPWGNVHATC",
    "MSTNPKPQRFAWL",
]
# contig IDs for each protein sequence
contig_ids = ["contig_1", "contig_1", "contig_2", "contig_3"]
# NOTE: equivalently, you can indicate the contig IDs by representing the protein_sequences list as a nested list
# i.e. protein_sequences = [["MGYDLVAGFQKNVRTI", "MKAILVVLLG"], ["MQLIESRFYKDPWGNVHATC"], ["MSTNPKPQRFAWL"]]
# and skipping the contig_ids argument below (defaults to None)

# embed the proteins with ESM-2 to get average protein embeddings
inputs = protein_seqs_to_bacformer_inputs(
    protein_sequences,
    contig_ids=contig_ids,
    device=device,
    batch_size=128,  # the batch size for computing the protein embeddings
    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with
    bacformer_model_type="large",
)

# contig_ids represent the contig information for each protein
print(inputs['contig_ids'])
# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

Processing and embedding whole bacterial genome

Process a whole bacterial genome assembly from GenBank (in this case, Pseudomonas aeruginosa PAO1 genome) and compute contextualised protein embeddings with Bacformer.

import torch
from transformers import AutoModel
from bacformer.pp import preprocess_genome_assembly, protein_seqs_to_bacformer_inputs


# preprocess a bacterial genome assembly
genome_info = preprocess_genome_assembly(filepath="files/pao1.gbff")

# load the model
device = "cuda:0"
model = AutoModel.from_pretrained(
    "macwiatrak/bacformer-large-masked-complete-genomes", trust_remote_code=True
).to(device).eval().to(torch.bfloat16)


# embed the proteins with a portein language model to get average protein embeddings, takes <1min on A100 GPU
inputs = protein_seqs_to_bacformer_inputs(
    genome_info['protein_sequence'],
    device=device,
    batch_size=128,  # the batch size for computing the protein embeddings
    max_n_proteins=6000,  # the maximum number of proteins Bacformer was trained with
    bacformer_model_type="large", # must be equal to "large" (Bacformer Large 300M) or "base" (Bacformer 26M)
)

# compute contextualised protein embeddings with Bacformer
with torch.no_grad():
    outputs = model(**inputs, return_dict=True)

# the resulting contextalized protein embeddings can be used for analysis
print('last hidden state shape:', outputs["last_hidden_state"].shape)

Embed dataset column with Bacformer

Use Bacformer to embed a column of protein sequences from a HuggingFace dataset. The example below can be easily adapted to a pandas DataFrame or any other data structure containing protein sequences.

Below we show how to compute contextualised protein embeddings for all proteins present in the genome required for operon prediction, or how to compute a single genome embedding for a set of genomes required for strain clustering.

from bacformer.pp import embed_dataset_col
from datasets import load_dataset


# load the operon dataset from long-read RNA sequencing
operon_dataset = load_dataset("macwiatrak/operon-identification-long-read-rna-sequencing", split="test")

# embed the protein sequences with Bacformer
# we compute contextualised protein embeddings for all proteins in the genome
operon_dataset = embed_dataset_col(
    dataset=operon_dataset,
    model_path="macwiatrak/bacformer-masked-complete-genomes",
    max_n_proteins=9000,
    genome_pooling_method=None,  # set to None to get embeddings for all proteins in the genome
    model_type="bacformer",  # for Bacformer 26M model, use "bacformer_large" for Bacformer Large 300M
)


# load the strain clustering toy dataset
strain_clustering_dataset = load_dataset("macwiatrak/strain-clustering-protein-sequences-sample", split="train")

# embed the protein sequences with Bacformer
# use mean genome pooling as we need a single genome embedding for each genome for clustering
strain_clustering_dataset = embed_dataset_col(
    dataset=strain_clustering_dataset,
    model_path="macwiatrak/bacformer-large-masked-MAG",
    max_n_proteins=9000,
    genome_pooling_method="mean",
    model_type="bacformer_large",  # for Bacformer 300M model, use "bacformer" for Bacformer Large 26M
)

# convert to pandas and print the first 5 rows
strain_clustering_df = strain_clustering_dataset.to_pandas()
strain_clustering_df.head()

Tutorials

We provide a set of tutorials to help you get started with Bacformer. The tutorials cover the following topics:

We are actively working on more tutorials, so stay tuned! If you have any suggestions for tutorials, please let us know by raising an issue in the issue tracker.

HuggingFace

Bacformer is integrated with HuggingFace.

import torch
from transformers import AutoModel, AutoModelForMaskedLM, AutoModelForCausalLM

device = "cuda:0"
# load the Bacformer model trained on MAGs with an autoregressive objective
masked_large_model = AutoModelForMaskedLM.from_pretrained("macwiatrak/bacformer-large-masked-complete-genomes", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

# load the Bacformer model trained on MAGs with an autoregressive objective
causal_model = AutoModelForCausalLM.from_pretrained("macwiatrak/bacformer-causal-MAG", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

# load the Bacformer model trained on MAGs with a masked objective
masked_model = AutoModelForMaskedLM.from_pretrained("macwiatrak/bacformer-masked-MAG", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

# load the Bacformer encoder model finetuned on complete genomes (i.e. without the protein family classification head)
# we recommend using this model for complete genomes as a start for finetuning on your own dataset for all tasks except generation
encoder_model = AutoModel.from_pretrained("macwiatrak/bacformer-large-masked-complete-genomes", trust_remote_code=True).to(torch.bfloat16).eval().to(device)

Pretrained model checkpoints

We provide a range of pretrained model checkpoints for Bacformer which are available via HuggingFace.

Checkpoint name	Genome type	Description
`bacformer-large-masked-MAG`	MAG	A 300M parameter model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of the proteins.
`bacformer-large-masked-complete-genomes`	Complete (i.e. uninterrupted)	A 300M parameter model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of the proteins.
`bacformer-causal-MAG`	MAG	A 26M parameter model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with an autoregressive objective.
`bacformer-masked-MAG`	MAG	A 26M parameter model pretrained on the ~1.3 M metagenome-assembled genomes (MAG) with a masked objective, randomly masking 15 % of proteins.
`bacformer-causal-complete-genomes`	Complete (i.e. uninterrupted)	A `bacformer-causal-MAG` finetuned on a set of ~40 k complete genomes with an autoregressive objective.
`bacformer-masked-complete-genomes`	Complete (i.e. uninterrupted)	A `bacformer-masked-MAG` finetuned on a set of ~40 k complete genomes with a masked objective, randomly masking 15 % of the proteins.
`bacformer-causal-protein-family-modeling-complete-genomes`	Complete (i.e. uninterrupted)	A `bacformer-causal-MAG` finetuned on a set of ~40 k complete genomes with an autoregressive objective. In contrast to other models, this model takes as input a protein-family token rather than the protein sequence, allowing generation of sequences of protein families.
`bacformer-for-essential-genes-prediction`	Complete (i.e. uninterrupted)	A `bacformer-masked-complete-genomes` finetuned on the essential-genes prediction task.

Contributing

We welcome contributions to Bacformer! If you would like to contribute, please follow these steps:

Fork the repository.
Install pre-commit and set up the pre-commit hooks (make sure to do it at the root of the repository).

pip install pre-commit
pre-commit install

Create a new branch for your feature or bug fix.
Make your changes and commit them.
Push your changes to your forked repository.
Create a pull request to the main repository.
Make sure to add tests for your changes and run the tests to ensure everything is working correctly.

Contact

For questions, bugs, and feature requests, please raise an issue in the repository.

Citation

@article{Wiatrak2025.07.20.665723,
	author = {Wiatrak, Maciej and Vi{\~n}as Torn{\'e}, Ramon and Ntemourtsidou, Maria and Dinan, Adam M. and Abelson, David C. and Arora, Divya and Brbi{\'c}, Maria and Weimann, Aaron and Floto, Rodrigo Andres},
	title = {A contextualised protein language model reveals the functional syntax of bacterial evolution},
	elocation-id = {2025.07.20.665723},
	year = {2025},
	doi = {10.1101/2025.07.20.665723},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/07/20/2025.07.20.665723},
	eprint = {https://www.biorxiv.org/content/early/2025/07/20/2025.07.20.665723.full.pdf},
	journal = {bioRxiv}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Jan 20, 2026

0.1.0

Jul 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bacformer-0.2.0.tar.gz (24.4 MB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bacformer-0.2.0-py3-none-any.whl (60.9 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file bacformer-0.2.0.tar.gz.

File metadata

Download URL: bacformer-0.2.0.tar.gz
Upload date: Jan 20, 2026
Size: 24.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.17

File hashes

Hashes for bacformer-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`1b6f241f2eeb4cb6af7e789b6a03483dfe235d71a45f5926beb79a7439b78992`
MD5	`8f02822a97f654cf252af691c270801e`
BLAKE2b-256	`130b61faed2a6e28fcdd8114d0a9c4c3d9f4a2d5cd0be063a88ed822dc3f352c`

See more details on using hashes here.

File details

Details for the file bacformer-0.2.0-py3-none-any.whl.

File metadata

Download URL: bacformer-0.2.0-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 60.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.17

File hashes

Hashes for bacformer-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1671cb328d2413d54ccb5915bec3ab0f0efebb5d04d752a7f2a228828d6dd2c6`
MD5	`74e0e2a5306af60cc752d4138f922ca4`
BLAKE2b-256	`95cfe3795bb123b0f23fb5c7a535ea8b65e4f45050cce890d50772e4a1284f38`

See more details on using hashes here.

bacformer 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Bacformer

News

Contents

Setup

Requirements

Installation

Usage

Computing contextual protein embeddings on a set of toy protein sequences

Encoding contig/chromosome/plasmid information

Processing and embedding whole bacterial genome

Embed dataset column with Bacformer

Tutorials

HuggingFace

Pretrained model checkpoints

Contributing

Contact

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes