Embedding-first deep learning multiple sequence alignment engine with affine-gap DP
Project description
BABAPPAlign
Overview
BABAPPAlign is an embedding-first progressive multiple sequence alignment (MSA) engine for protein sequences. It integrates pretrained protein language model embeddings with a learned neural residue–residue scoring function within a classical, exact affine-gap dynamic programming framework (Gotoh).
The method is designed to improve alignment accuracy while maintaining methodological transparency and full reproducibility. BABAPPAlign is fully functional on CPU-only systems; GPU acceleration is optional and affects performance only, not correctness.
Key features
- Progressive multiple sequence alignment (MSA)
- Strict learned residue–residue scoring model (BABAPPAScore)
- Uses pretrained protein language model residue embeddings
- Column-aware profile scoring
- True affine-gap dynamic programming (Gotoh algorithm)
- Exact dynamic programming (no heuristics inside DP)
- Embedding inference performed outside DP
- Fully functional on CPU-only systems
- Optional GPU acceleration for faster embedding and scoring
- Explicit model specification (no silent fallback)
- Reproducible and Bioconda-compliant design
Installation
Install from PyPI
pip install babappalign
Install from Bioconda
conda install -c bioconda babappalign
This installs a CPU-compatible version of BABAPPAlign. No GPU, CUDA, or special hardware is required.
Quick start
babappalign input.fasta -o output.aln.fasta --model babappascore
Important:
BABAPPAlign requires an external trained neural scoring model. The model is not downloaded automatically and must be obtained explicitly (see below).
How BABAPPAlign works
-
Residue embedding
Each protein sequence is converted into residue-level embeddings using a pretrained protein language model. -
Learned residue scoring
Residue compatibility is evaluated using a pretrained neural scoring model (BABAPPAScore), replacing traditional substitution matrices. -
Progressive alignment
Sequences are progressively aligned using exact affine-gap dynamic programming (Gotoh). Neural inference is performed outside the DP recursion to preserve correctness.
The progressive ordering is a computational heuristic and is not interpreted as a phylogeny.
Model weights (required)
BABAPPAlign requires a trained neural residue-level scoring model (BABAPPAScore), which is distributed separately via Zenodo.
Concept DOI (all versions):
https://doi.org/10.5281/zenodo.18053200
Version-specific DOIs are provided on Zenodo for exact reproducibility.
Download and use
# 1. Download the model (one-time)
mkdir -p ~/.cache/babappalign/models
wget https://zenodo.org/record/18053201/files/babappascore.pt -O ~/.cache/babappalign/models/babappascore.pt
# 2a. Run BABAPPAlign using the cached model name (recommended)
babappalign input.fasta -o aligned.fasta --model babappascore
# 2b. OR run BABAPPAlign using an explicit model path (equivalent)
babappalign input.fasta -o aligned.fasta \
--model ~/.cache/babappalign/models/babappascore.pt
At runtime, BABAPPAlign prints the resolved model path and a SHA-256 checksum to ensure transparent and reproducible model usage.
CPU and GPU execution
BABAPPAlign produces identical alignments on CPU and GPU. GPU acceleration affects performance only.
| Component | CPU | GPU |
|---|---|---|
| Progressive alignment (DP) | Yes | Yes |
| Learned scoring | Yes | Yes |
| Embedding generation | Slower | Faster |
Input requirements
- Protein sequences only
- FASTA format
- No strict limits on sequence length or number (runtime depends on hardware)
Command-line interface
babappalign --help
Key options include:
-o, --output FILE: output alignment file--model MODEL: scoring model name or path (mandatory)--gap-open FLOAT: gap opening penalty--gap-extend FLOAT: gap extension penalty--device {cpu,cuda}: select execution device
License
MIT License. See the LICENSE file for details.
Citation
Manuscript in preparation.
Author and repository
- Author: Krishnendu Sinha
- GitHub: https://github.com/sinhakrishnendu/BABAPPAlign
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file babappalign-1.1.3.tar.gz.
File metadata
- Download URL: babappalign-1.1.3.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
126b9f3573d4117eadbf7ec51609e83b654571b3d280c513c1b6fc07cd184833
|
|
| MD5 |
40397962e327699e04804a52f3da4c5e
|
|
| BLAKE2b-256 |
d9208a823f40506781ec915d66042c90bd11e4f6930f3f7a3da708817797acb7
|
File details
Details for the file babappalign-1.1.3-py3-none-any.whl.
File metadata
- Download URL: babappalign-1.1.3-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.19
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a0cd451a8972bcc88ef94203e8d83b41d74ab74b622c3063dfcdd7c341bcfc5
|
|
| MD5 |
6a42820935d81c1df9f5a7cc05b97bb8
|
|
| BLAKE2b-256 |
0aeeab975d7422426946db202242b7b5d095ae36c3573647d147fedd50f9c0fd
|