Embedding-first deep learning multiple sequence alignment engine with affine-gap DP
Project description
BABAPPAlign
Overview
BABAPPAlign is an embedding-first progressive multiple sequence alignment (MSA) engine for protein sequences. It integrates pretrained protein language model embeddings with a learned neural residue–residue scoring function within a classical, exact affine-gap dynamic programming framework.
The method is designed to improve alignment accuracy while remaining fully functional on CPU-only systems. GPU acceleration is optional and affects performance only, not correctness.
Key features
- Progressive multiple sequence alignment (MSA)
- Learned residue–residue scoring model (BABAPPAScore)
- Uses pretrained ESM2 residue embeddings
- Data-driven guide tree construction using Neighbor Joining (NJ)
- Optional residue-level bootstrap with majority-rule consensus topology
- True affine-gap dynamic programming (Gotoh algorithm)
- Symmetric profile–profile alignment
- Fully functional on CPU-only systems
- Optional GPU acceleration for faster embedding generation and scoring
- Automatic caching of model weights
- Distributed via Bioconda
Installation
Install from Bioconda (recommended)
conda install -c bioconda babappalign
This installs a CPU-compatible version of BABAPPAlign. No GPU, CUDA, or special hardware is required.
Quick start
Basic usage
babappalign input.fasta -o output.aln.fasta
On first use, the pretrained scoring model is downloaded automatically.
How BABAPPAlign works
-
Residue embedding
Each protein sequence is converted into residue-level embeddings using a pretrained ESM2 model. -
Guide tree construction
Sequence-level embeddings are obtained by pooling residue embeddings. Pairwise distances are defined using cosine dissimilarity, and a guide tree is inferred using the Neighbor Joining (NJ) algorithm. Optionally, residue-level bootstrapping can be used to construct a majority-rule consensus tree. -
Learned residue scoring
Residue compatibility is evaluated using a pretrained neural scoring model (BABAPPAScore), which replaces traditional substitution matrices. -
Progressive alignment
Sequences and profiles are progressively aligned following the guide tree using exact affine-gap dynamic programming (Gotoh), with symmetric profile–profile alignment.
The guide tree is used as a computational heuristic and is not interpreted as a phylogeny.
Model weights and automatic download
BABAPPAlign relies on a pretrained neural residue–residue scoring model (babappascore.pt).
Due to its size, the model weights are not bundled with the software package.
Automatic model retrieval
When BABAPPAlign is run for the first time, the pretrained scoring model is automatically downloaded from the official GitHub release corresponding to the installed version. The model file is cached locally and reused for subsequent runs.
No manual download or configuration is required.
Cache location
By default, the model is stored under the user cache directory:
~/.cache/babappalign/models/babappascore.pt
The cache location follows the XDG base directory specification where applicable.
Offline and custom models
Users may optionally supply a local model file:
babappalign input.fasta -o output.aln.fasta --model /path/to/babappascore.pt
This is useful for offline environments, custom-trained models, or reproducibility experiments.
CPU and GPU execution
BABAPPAlign produces identical alignments on CPU and GPU. GPU acceleration is used only to improve performance.
| Component | CPU | GPU |
|---|---|---|
| Guide tree construction | Yes | Yes |
| Progressive alignment (DP) | Yes | Yes |
| Learned scoring | Yes | Yes |
| Embedding generation | Slower | Faster |
Input requirements
- Protein sequences only
- FASTA format
- No strict limits on sequence length or number (runtime depends on hardware)
Command-line interface
babappalign --help
Key options include:
-o, --output FILE: output alignment file--model FILE: use a local scoring model--bootstrap N: number of bootstrap replicates for guide tree construction--gap-open FLOAT: gap opening penalty--gap-extend FLOAT: gap extension penalty--device {cpu,cuda}: select execution device
License
MIT License. See the LICENSE file for details.
Citation
Manuscript in preparation.
Author and repository
- Author: Krishnendu Sinha
- GitHub: https://github.com/sinhakrishnendu/BABAPPAlign
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file babappalign-1.1.0.tar.gz.
File metadata
- Download URL: babappalign-1.1.0.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6976330f34855390a338a71d94d34db74e98280109b4a27211c4bbd5cf117121
|
|
| MD5 |
e592fef95c02fe70f59edfafa3f279f1
|
|
| BLAKE2b-256 |
a24b7994c1d7a057ef8b4bc3dedd17ddef1863305d41827e70830493ed373900
|
File details
Details for the file babappalign-1.1.0-py3-none-any.whl.
File metadata
- Download URL: babappalign-1.1.0-py3-none-any.whl
- Upload date:
- Size: 13.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e19f92a7bba8e7b228b00ca01d414ebb0b84fbba11a82b08d71a0296e81bec46
|
|
| MD5 |
5de0cb46b23a18ab5fb33e0193bdcf58
|
|
| BLAKE2b-256 |
8f6dbac01da605a0bdd3b8f78c79dcf42c8bacc68b621eabc95159f0adbe6f78
|