Skip to main content

Embedding-first deep learning multiple sequence alignment engine with affine-gap DP

Project description

BABAPPAlign

DOI

Overview

BABAPPAlign is an embedding-first progressive multiple sequence alignment (MSA) engine for protein sequences. It integrates pretrained protein language model embeddings with a learned neural residue–residue scoring function within a classical, exact affine-gap dynamic programming framework (Gotoh).

The method is designed to improve alignment accuracy while maintaining methodological transparency and full reproducibility. BABAPPAlign is fully functional on CPU-only systems; GPU acceleration is optional and affects performance only, not correctness.


Key features

  • Progressive multiple sequence alignment (MSA)
  • Strict learned residue–residue scoring model (BABAPPAScore)
  • Uses pretrained protein language model residue embeddings
  • Column-aware profile scoring
  • True affine-gap dynamic programming (Gotoh algorithm)
  • Exact dynamic programming (no heuristics inside DP)
  • Embedding inference performed outside DP
  • Fully functional on CPU-only systems
  • Optional GPU acceleration for faster embedding and scoring
  • Explicit model specification (no silent fallback)
  • Reproducible and Bioconda-compliant design

Installation

Install from PyPI

pip install babappalign

Install from Bioconda

conda install -c bioconda babappalign

This installs a CPU-compatible version of BABAPPAlign. No GPU, CUDA, or special hardware is required.


Quick start

babappalign input.fasta -o output.aln.fasta --model babappascore

Important:
BABAPPAlign requires an external trained neural scoring model. The model is not downloaded automatically and must be obtained explicitly (see below).


How BABAPPAlign works

  1. Residue embedding
    Each protein sequence is converted into residue-level embeddings using a pretrained protein language model.

  2. Learned residue scoring
    Residue compatibility is evaluated using a pretrained neural scoring model (BABAPPAScore), replacing traditional substitution matrices.

  3. Progressive alignment
    Sequences are progressively aligned using exact affine-gap dynamic programming (Gotoh). Neural inference is performed outside the DP recursion to preserve correctness.

The progressive ordering is a computational heuristic and is not interpreted as a phylogeny.


Model weights (required)

BABAPPAlign requires a trained neural residue-level scoring model (BABAPPAScore), which is distributed separately via Zenodo.

Concept DOI (all versions):
https://doi.org/10.5281/zenodo.18053200

Version-specific DOIs are provided on Zenodo for exact reproducibility.

Download and use

# 1. Download the model (one-time)
mkdir -p ~/.cache/babappalign/models

wget https://zenodo.org/record/18053201/files/babappascore.pt      -O ~/.cache/babappalign/models/babappascore.pt

# 2a. Run BABAPPAlign using the cached model name (recommended)
babappalign input.fasta -o aligned.fasta --model babappascore

# 2b. OR run BABAPPAlign using an explicit model path (equivalent)
babappalign input.fasta -o aligned.fasta \
  --model ~/.cache/babappalign/models/babappascore.pt

At runtime, BABAPPAlign prints the resolved model path and a SHA-256 checksum to ensure transparent and reproducible model usage.


CPU and GPU execution

BABAPPAlign produces identical alignments on CPU and GPU. GPU acceleration affects performance only.

Component CPU GPU
Progressive alignment (DP) Yes Yes
Learned scoring Yes Yes
Embedding generation Slower Faster

Input requirements

  • Protein sequences only
  • FASTA format
  • No strict limits on sequence length or number (runtime depends on hardware)

Command-line interface

babappalign --help

Key options include:

  • -o, --output FILE : output alignment file
  • --model MODEL : scoring model name or path (mandatory)
  • --gap-open FLOAT : gap opening penalty
  • --gap-extend FLOAT : gap extension penalty
  • --device {cpu,cuda} : select execution device

License

MIT License. See the LICENSE file for details.


Citation

Manuscript in preparation.


Author and repository

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babappalign-1.1.3.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babappalign-1.1.3-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file babappalign-1.1.3.tar.gz.

File metadata

  • Download URL: babappalign-1.1.3.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for babappalign-1.1.3.tar.gz
Algorithm Hash digest
SHA256 126b9f3573d4117eadbf7ec51609e83b654571b3d280c513c1b6fc07cd184833
MD5 40397962e327699e04804a52f3da4c5e
BLAKE2b-256 d9208a823f40506781ec915d66042c90bd11e4f6930f3f7a3da708817797acb7

See more details on using hashes here.

File details

Details for the file babappalign-1.1.3-py3-none-any.whl.

File metadata

  • Download URL: babappalign-1.1.3-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.19

File hashes

Hashes for babappalign-1.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 2a0cd451a8972bcc88ef94203e8d83b41d74ab74b622c3063dfcdd7c341bcfc5
MD5 6a42820935d81c1df9f5a7cc05b97bb8
BLAKE2b-256 0aeeab975d7422426946db202242b7b5d095ae36c3573647d147fedd50f9c0fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page