Skip to main content

Embedding-first deep learning multiple sequence alignment engine with affine-gap DP

Project description

BABAPPAlign

DOI

Overview

BABAPPAlign is an embedding-first progressive multiple sequence alignment (MSA) engine for protein sequences. It integrates pretrained protein language model embeddings with a learned neural residue–residue scoring function within a classical, exact affine-gap dynamic programming framework.

The method is designed to improve alignment accuracy while remaining fully functional on CPU-only systems. GPU acceleration is optional and affects performance only, not correctness.


Key features

  • Progressive multiple sequence alignment (MSA)
  • Learned residue–residue scoring model (BABAPPAScore)
  • Uses pretrained ESM2 residue embeddings
  • Data-driven guide tree construction using Neighbor Joining (NJ)
  • Optional residue-level bootstrap with majority-rule consensus topology
  • True affine-gap dynamic programming (Gotoh algorithm)
  • Symmetric profile–profile alignment
  • Fully functional on CPU-only systems
  • Optional GPU acceleration for faster embedding generation and scoring
  • Automatic caching of model weights
  • Distributed via Bioconda

Installation

Install from Bioconda (recommended)

conda install -c bioconda babappalign

This installs a CPU-compatible version of BABAPPAlign. No GPU, CUDA, or special hardware is required.


Quick start

Basic usage

babappalign input.fasta -o output.aln.fasta

On first use, the pretrained scoring model is downloaded automatically.


How BABAPPAlign works

  1. Residue embedding
    Each protein sequence is converted into residue-level embeddings using a pretrained ESM2 model.

  2. Guide tree construction
    Sequence-level embeddings are obtained by pooling residue embeddings. Pairwise distances are defined using cosine dissimilarity, and a guide tree is inferred using the Neighbor Joining (NJ) algorithm. Optionally, residue-level bootstrapping can be used to construct a majority-rule consensus tree.

  3. Learned residue scoring
    Residue compatibility is evaluated using a pretrained neural scoring model (BABAPPAScore), which replaces traditional substitution matrices.

  4. Progressive alignment
    Sequences and profiles are progressively aligned following the guide tree using exact affine-gap dynamic programming (Gotoh), with symmetric profile–profile alignment.

The guide tree is used as a computational heuristic and is not interpreted as a phylogeny.


Model weights and automatic download

BABAPPAlign relies on a pretrained neural residue–residue scoring model (babappascore.pt). Due to its size, the model weights are not bundled with the software package.

Automatic model retrieval

When BABAPPAlign is run for the first time, the pretrained scoring model is automatically downloaded from the official GitHub release corresponding to the installed version. The model file is cached locally and reused for subsequent runs.

No manual download or configuration is required.

Cache location

By default, the model is stored under the user cache directory:

~/.cache/babappalign/models/babappascore.pt

The cache location follows the XDG base directory specification where applicable.

Offline and custom models

Users may optionally supply a local model file:

babappalign input.fasta -o output.aln.fasta --model /path/to/babappascore.pt

This is useful for offline environments, custom-trained models, or reproducibility experiments.


CPU and GPU execution

BABAPPAlign produces identical alignments on CPU and GPU. GPU acceleration is used only to improve performance.

Component CPU GPU
Guide tree construction Yes Yes
Progressive alignment (DP) Yes Yes
Learned scoring Yes Yes
Embedding generation Slower Faster

Input requirements

  • Protein sequences only
  • FASTA format
  • No strict limits on sequence length or number (runtime depends on hardware)

Command-line interface

babappalign --help

Key options include:

  • -o, --output FILE : output alignment file
  • --model FILE : use a local scoring model
  • --bootstrap N : number of bootstrap replicates for guide tree construction
  • --gap-open FLOAT : gap opening penalty
  • --gap-extend FLOAT : gap extension penalty
  • --device {cpu,cuda} : select execution device

License

MIT License. See the LICENSE file for details.


Citation

Manuscript in preparation.


Author and repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babappalign-1.1.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babappalign-1.1.0-py3-none-any.whl (13.7 kB view details)

Uploaded Python 3

File details

Details for the file babappalign-1.1.0.tar.gz.

File metadata

  • Download URL: babappalign-1.1.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for babappalign-1.1.0.tar.gz
Algorithm Hash digest
SHA256 6976330f34855390a338a71d94d34db74e98280109b4a27211c4bbd5cf117121
MD5 e592fef95c02fe70f59edfafa3f279f1
BLAKE2b-256 a24b7994c1d7a057ef8b4bc3dedd17ddef1863305d41827e70830493ed373900

See more details on using hashes here.

File details

Details for the file babappalign-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: babappalign-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 13.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for babappalign-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e19f92a7bba8e7b228b00ca01d414ebb0b84fbba11a82b08d71a0296e81bec46
MD5 5de0cb46b23a18ab5fb33e0193bdcf58
BLAKE2b-256 8f6dbac01da605a0bdd3b8f78c79dcf42c8bacc68b621eabc95159f0adbe6f78

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page