Skip to main content

Embedding-first deep learning multiple sequence alignment engine with affine-gap DP

Project description

BABAPPAlign

DOI

Overview

BABAPPAlign is an embedding-first progressive multiple sequence alignment (MSA) engine for protein sequences. It integrates pretrained protein language model embeddings with a learned neural residue–residue scoring function within a classical, exact affine-gap dynamic programming framework (Gotoh).

The method is designed to improve alignment accuracy while maintaining methodological transparency and full reproducibility. BABAPPAlign is fully functional on CPU-only systems; GPU acceleration is optional and affects performance only, not correctness.


Key features

  • Progressive multiple sequence alignment (MSA)
  • Strict learned residue–residue scoring model (BABAPPAScore)
  • Uses pretrained protein language model residue embeddings
  • Column-aware profile scoring
  • True affine-gap dynamic programming (Gotoh algorithm)
  • Exact dynamic programming (no heuristics inside DP)
  • Embedding inference performed outside DP
  • Fully functional on CPU-only systems
  • Optional GPU acceleration for faster embedding and scoring
  • Explicit model specification (no silent fallback)
  • Reproducible and Bioconda-compliant design

Installation

Install from PyPI

pip install babappalign

Install from Bioconda

conda install -c bioconda babappalign

This installs a CPU-compatible version of BABAPPAlign. No GPU, CUDA, or special hardware is required.


Quick start

babappalign input.fasta -o output.aln.fasta --model babappascore

Important:
BABAPPAlign requires an external trained neural scoring model. The model is not downloaded automatically and must be obtained explicitly (see below).


How BABAPPAlign works

  1. Residue embedding
    Each protein sequence is converted into residue-level embeddings using a pretrained protein language model.

  2. Learned residue scoring
    Residue compatibility is evaluated using a pretrained neural scoring model (BABAPPAScore), replacing traditional substitution matrices.

  3. Progressive alignment
    Sequences are progressively aligned using exact affine-gap dynamic programming (Gotoh). Neural inference is performed outside the DP recursion to preserve correctness.

The progressive ordering is a computational heuristic and is not interpreted as a phylogeny.


Model weights (required)

BABAPPAlign requires a trained neural residue-level scoring model (BABAPPAScore), which is distributed separately via Zenodo.

Concept DOI (all versions):
https://doi.org/10.5281/zenodo.18053200

Version-specific DOIs are provided on Zenodo for exact reproducibility.

Download and use

# 1. Download the model (one-time)
mkdir -p ~/.cache/babappalign/models

wget https://zenodo.org/record/18053201/files/babappascore.pt      -O ~/.cache/babappalign/models/babappascore.pt

# 2a. Run BABAPPAlign using the cached model name (recommended)
babappalign input.fasta -o aligned.fasta --model babappascore

# 2b. OR run BABAPPAlign using an explicit model path (equivalent)
babappalign input.fasta -o aligned.fasta \
  --model ~/.cache/babappalign/models/babappascore.pt

At runtime, BABAPPAlign prints the resolved model path and a SHA-256 checksum to ensure transparent and reproducible model usage.


CPU and GPU execution

BABAPPAlign produces identical alignments on CPU and GPU. GPU acceleration affects performance only.

Component CPU GPU
Progressive alignment (DP) Yes Yes
Learned scoring Yes Yes
Embedding generation Slower Faster

Input requirements

  • Protein sequences only
  • FASTA format
  • No strict limits on sequence length or number (runtime depends on hardware)

Command-line interface

babappalign --help

Key options include:

  • -o, --output FILE : output alignment file
  • --model MODEL : scoring model name or path (mandatory)
  • --gap-open FLOAT : gap opening penalty
  • --gap-extend FLOAT : gap extension penalty
  • --device {cpu,cuda} : select execution device

License

MIT License. See the LICENSE file for details.


Citation

Manuscript in preparation.


Author and repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babappalign-1.1.2.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babappalign-1.1.2-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file babappalign-1.1.2.tar.gz.

File metadata

  • Download URL: babappalign-1.1.2.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for babappalign-1.1.2.tar.gz
Algorithm Hash digest
SHA256 1fd0b02dfc4ffa4b6094798afd61f9d37383d46d1476570b689790bd9f16af12
MD5 e87b1f52cd08731ef5255fde7f714a94
BLAKE2b-256 2b15d0e7ea39dafff5ae55dac51e4703db5f448ff731c7a5e0b5d12ce8a4ed29

See more details on using hashes here.

File details

Details for the file babappalign-1.1.2-py3-none-any.whl.

File metadata

  • Download URL: babappalign-1.1.2-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for babappalign-1.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4c1ad31b36af7a5b70f9b20d5d0ab956acb104a793b93e2260dc727705b44c41
MD5 18d80e73cd520778bc9108dda85f54bd
BLAKE2b-256 7a0ec0eeb321a83a5230bd1bc70de6f19d12074a3453cfaf7ec6ae1abe01deb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page