Skip to main content

Embedding-first deep learning multiple sequence alignment engine with affine-gap DP

Project description

BABAPPAlign

DOI

Overview

BABAPPAlign is an embedding-first progressive multiple sequence alignment (MSA) engine for protein sequences. It integrates pretrained protein language model embeddings with a learned neural residue–residue scoring function within a classical, exact affine-gap dynamic programming framework (Gotoh).

The method is designed to improve alignment accuracy while maintaining methodological transparency and full reproducibility. BABAPPAlign is fully functional on CPU-only systems; GPU acceleration is optional and affects performance only, not correctness.


Key features

  • Progressive multiple sequence alignment (MSA)
  • Strict learned residue–residue scoring model (BABAPPAScore)
  • Uses pretrained protein language model residue embeddings
  • Column-aware profile scoring
  • True affine-gap dynamic programming (Gotoh algorithm)
  • Exact dynamic programming (no heuristics inside DP)
  • Embedding inference performed outside DP
  • Fully functional on CPU-only systems
  • Optional GPU acceleration for faster embedding and scoring
  • Explicit model specification (no silent fallback)
  • Reproducible and Bioconda-compliant design

Installation

Install from PyPI

pip install babappalign

Install from Bioconda

conda install -c bioconda babappalign

This installs a CPU-compatible version of BABAPPAlign. No GPU, CUDA, or special hardware is required.


Quick start

babappalign input.fasta -o output.aln.fasta --model babappascore

Important:
BABAPPAlign requires an external trained neural scoring model. The model is not downloaded automatically and must be obtained explicitly (see below).


How BABAPPAlign works

  1. Residue embedding
    Each protein sequence is converted into residue-level embeddings using a pretrained protein language model.

  2. Learned residue scoring
    Residue compatibility is evaluated using a pretrained neural scoring model (BABAPPAScore), replacing traditional substitution matrices.

  3. Progressive alignment
    Sequences are progressively aligned using exact affine-gap dynamic programming (Gotoh). Neural inference is performed outside the DP recursion to preserve correctness.

The progressive ordering is a computational heuristic and is not interpreted as a phylogeny.


Model weights (required)

BABAPPAlign requires a trained neural residue-level scoring model (BABAPPAScore), which is distributed separately via Zenodo.

Concept DOI (all versions):
https://doi.org/10.5281/zenodo.18053200

Version-specific DOIs are provided on Zenodo for exact reproducibility.

Download and use

# 1. Download the model (one-time)
mkdir -p ~/.cache/babappalign/models

wget https://zenodo.org/record/18053201/files/babappascore.pt      -O ~/.cache/babappalign/models/babappascore.pt

# 2a. Run BABAPPAlign using the cached model name (recommended)
babappalign input.fasta -o aligned.fasta --model babappascore

# 2b. OR run BABAPPAlign using an explicit model path (equivalent)
babappalign input.fasta -o aligned.fasta \
  --model ~/.cache/babappalign/models/babappascore.pt

At runtime, BABAPPAlign prints the resolved model path and a SHA-256 checksum to ensure transparent and reproducible model usage.


CPU and GPU execution

BABAPPAlign produces identical alignments on CPU and GPU. GPU acceleration affects performance only.

Component CPU GPU
Progressive alignment (DP) Yes Yes
Learned scoring Yes Yes
Embedding generation Slower Faster

Input requirements

  • Protein sequences only
  • FASTA format
  • No strict limits on sequence length or number (runtime depends on hardware)

Command-line interface

babappalign --help

Key options include:

  • -o, --output FILE : output alignment file
  • --model MODEL : scoring model name or path (mandatory)
  • --gap-open FLOAT : gap opening penalty
  • --gap-extend FLOAT : gap extension penalty
  • --device {cpu,cuda} : select execution device

License

MIT License. See the LICENSE file for details.


Citation

Manuscript in preparation.


Author and repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babappalign-1.1.1.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babappalign-1.1.1-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file babappalign-1.1.1.tar.gz.

File metadata

  • Download URL: babappalign-1.1.1.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for babappalign-1.1.1.tar.gz
Algorithm Hash digest
SHA256 fc5e332e8ce992837dde4cfad62cb10b9d3aca7e260b410b452028187863de85
MD5 03ead81bdc5e9725e868895f355db4a5
BLAKE2b-256 e04f3f9d6093f719e51885565b795f3221944085c73e2558468a2ddb1b2e696d

See more details on using hashes here.

File details

Details for the file babappalign-1.1.1-py3-none-any.whl.

File metadata

  • Download URL: babappalign-1.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for babappalign-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 393800fa279b20778ac21b47a1412ad430fd45990018d38a7fa9afd783ce6b1f
MD5 0a5034aca05622296fb8fb9a57bf825f
BLAKE2b-256 c381d4ceb92470432e267cf6aa3bd7e36c2c1fbb69ba6c6500673d7ef98cc95f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page