Skip to main content

JIT-compiled string edit-distances including Needleman-Wunsch, Smith-Waterman, Wagner-Fisher, and Gotoh algorithms.

Project description

Affine Gaps Thumbnail

Affine Gaps is a less-wrong single-file Numba-accelerated Python implementation of Osamu Gotoh affine gap penalty extensions 1982 paper for the Needleman-Wunsch and Smith-Waterman algorithms often used for global and local sequence alignment in Bioinformatics. Thanks to the Numba JIT compiler, it's also competitive in terms of performance. But if you want to go even faster and need more hardware-accelerated string operations, check out StringZilla 🦖

Less Wrong

As reported in the "Are all global alignment algorithms and implementations correct?" paper by Tomas Flouri, Kassian Kobert, Torbjørn Rognes, and Alexandros Stamatakis:

In 1982 Gotoh presented an improved algorithm with lower time complexity. Gotoh’s algorithm is frequently cited... While implementing the algorithm, we discovered two mathematical mistakes in Gotoh’s paper that induce sub-optimal sequence alignments. First, there are minor indexing mistakes in the dynamic programming algorithm which become apparent immediately when implementing the procedure. Hence, we report on these for the sake of completeness. Second, there is a more profound problem with the dynamic programming matrix initialization. This initialization issue can easily be missed and find its way into actual implementations. This error is also present in standard text books. Namely, the widely used books by Gusfield and Waterman. To obtain an initial estimate of the extent to which this error has been propagated, we scrutinized freely available undergraduate lecture slides. We found that 8 out of 31 lecture slides contained the mistake, while 16 out of 31 simply omit parts of the initialization, thus giving an incomplete description of the algorithm. Finally, by inspecting ten source codes and running respective tests, we found that five implementations were incorrect.

During my exploration of exiting implementations, I've noticed several bugs:

  • several libraries initialize the header row/columns of penalty matrices with ±∞, causing overflows on the first iteration.
  • initialize matrices to zeros, ignoring the first gap opening cost.
  • combining opening and expansion costs where only the opening cost should be applied.
  • even the most correct needle from EMBOSS uses float representation, which would obviously be numerically unstable on very long sequences.

Installation

Numba is optional. Installing without it gives a pure-Python baseline; installing with the numba extra enables JIT acceleration when a compatible Numba is available.

uv pip install affine-gaps          # minimal
uv pip install 'affine-gaps[numba]' # with JIT

Even without installing Python or touching PyPi, you can just use uv to get the latest version of the library:

$ uv tool install git+https://github.com/ashvardanian/affine-gaps.git
$ affine-gaps --help

Using the Library

To obtain the alignment of two sequences, use the needleman_wunsch_gotoh_alignment function.

from affine_gaps import needleman_wunsch_gotoh_alignment

insulin = "GIVEQCCTSICSLYQLENYCN"
glucagon = "HSQGTFTSDYSKYLDSRAEQDFV"
aligned_insulin, aligned_glucagon, aligned_score = needleman_wunsch_gotoh_alignment(insulin, glucagon)

print("Alignment 1:", aligned_insulin)  # GI-V---EQCC-TSICSLY---QL-ENYCN-
print("Alignment 2:", aligned_glucagon) # --D-FVHSQGTFTSDYSKYLDSRAEQDF--V
print("Score:", aligned_score)          # 41

If you only need the alignment score, you can use the needleman_wunsch_gotoh_score function, which uses less memory and works faster.

from affine_gaps import needleman_wunsch_gotoh_score

score = needleman_wunsch_gotoh_score(insulin, glucagon)

print("Score:", score)

By default, a BLOSUM62 substitution matrix is used. You can specify a different substitution matrix by passing it as an argument.

from numpy import np

alphabet = "ARNDCQEGHILKMFPSTWYVBZX"
substitutions = np.zeros((len(alphabet), len(alphabet)), dtype=np.int8)
substitutions.fill(-1)
np.fill_diagonal(substitutions, 1)

aligned_insulin, aligned_glucagon, aligned_score = needleman_wunsch_gotoh_alignment(
    insulin, glucagon,
    substitution_alphabet=alphabet,
    substitution_matrix=substitutions,
    gap_opening=-2,
    gap_extension=-1,
)

That is similar to the following usage example of BioPython:

from Bio import Align
from Bio.Align import substitution_matrices

aligner = Align.PairwiseAligner(mode="global")
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
aligner.open_gap_score = open_gap_score
aligner.extend_gap_score = extend_gap_score

Using the Command Line Interface

To compute the optimal global alignment of insulin and glucagon sequences with (5x-scaled) BLOSUM62 substitution matrix through CLI:

$ affine-gaps GIVEQCCTSICSLYQLENYCN HSQGTFTSDYSKYLDSRAEQDFV
>
> Sequence 1: GIVEQCCTSICSLYQLENYCN
> Sequence 2: HSQGTFTSDYSKYLDSRAEQDFV
>
> Alignment 1: GIVEQCCTSICSLY---QL-ENYCN-
> Alignment 2: GTF----TSDYSKYLDSRAEQDF--V
> Score:       22

To compute the local alignment of insulin and glucagon sequences through CLI:

$ affine-gaps GIVEQCCTSICSLYQLENYCN HSQGTFTSDYSKYLDSRAEQDFV --local
> 
> Sequence 1: GIVEQCCTSICSLYQLENYCN
> Sequence 2: HSQGTFTSDYSKYLDSRAEQDFV
> 
> Alignment 1: TSICSLYQLEN
> Alignment 2: TSDYSKY-LDS
> Score:       80

Testing & Development

To test, install the development dependencies and run the tests.

pip install -e ".[dev,jit]"
pytest test.py -s -x

Symmetry Test for Needleman-Wunsch

First, verify that the Needleman-Wunsch algorithm is symmetric with respect to the argument order, assuming the substitution matrix is symmetric.

pytest test.py -s -x -k symmetry

Needleman-Wunsch and Levenshtein Score Equivalence

The Needleman-Wunsch alignment score should be equal to the negated Levenshtein distance for specific match/mismatch costs.

pytest test.py -s -x -k levenshtein

Alignment vs Scoring Consistency

Check that the alignment score is consistent with the scoring function for specific sequences and scoring parameters.

pytest test.py -s -x -k scoring_vs_alignment

Gap Expansion Test

Check the effect of gap expansions on alignment scores. This test ensures that increasing the width of gaps in alignments with zero gap extension penalties does not change the alignment score.

pytest test.py -s -x -k gap_expansions

Comparison with BioPython Examples

Compare the affine gap alignment scores with BioPython for specific sequence pairs and scoring parameters. This test ensures that the Needleman-Wunsch-Gotoh alignment scores are at least as good as BioPython's PairwiseAligner scores.

pytest test.py -s -x -k biopython_examples

Fuzzy Comparison with BioPython

Perform a fuzzy comparison of affine gap alignment scores with BioPython for randomly generated sequences. This test verifies that the Needleman-Wunsch-Gotoh alignment scores are at least as good as BioPython's PairwiseAligner scores for various gap penalties.

pytest test.py -s -x -k biopython_fuzzy

EMBOSS and Other Tools

Seemingly the only correct known open-source implementation is located in nucleus/embaln.c file in the EMBOSS package in the embAlignPathCalcWithEndGapPenalties and embAlignGetScoreNWMatrix functions. That program was originally implemented in 1999 by Alan Bleasby and tweaked in 2000 for better scoring. That implementation has no SIMD optimizations, branchless-computing tricks, or other modern optimizations, but it's still widely recommended. If you want to compare the results, you can download the EMBOSS source code and compile it with following commands:

wget -m 'ftp://emboss.open-bio.org/pub/EMBOSS/'
cd emboss.open-bio.org/pub/EMBOSS/
gunzip EMBOSS-latest.tar.gz
tar xf EMBOSS-latest.tar
cd EMBOSS-latest
./configure

Or if you simply want to explore the source:

cat emboss.open-bio.org/pub/EMBOSS/EMBOSS-6.6.0/nucleus/embaln.c

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

affine_gaps-0.2.4.tar.gz (22.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

affine_gaps-0.2.4-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file affine_gaps-0.2.4.tar.gz.

File metadata

  • Download URL: affine_gaps-0.2.4.tar.gz
  • Upload date:
  • Size: 22.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for affine_gaps-0.2.4.tar.gz
Algorithm Hash digest
SHA256 2c85685fc82f3c8a610522e07e56e031f3f2985ab96f5b708ebf8db8cf342de8
MD5 ce0be33e53dc2432ef71291028060c16
BLAKE2b-256 6f9b7420c3a23d5dd04e24b63f4d73937c7679fc60848d1a8509f33e84e86b76

See more details on using hashes here.

Provenance

The following attestation bundles were made for affine_gaps-0.2.4.tar.gz:

Publisher: release.yml on gata-bio/affine-gaps

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file affine_gaps-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: affine_gaps-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for affine_gaps-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f2056a73e88b267721bf40f8f9c4ef2129bec37c75461ae650e7544f962aa55a
MD5 7a0d714ad38f571c1dd164650c566ab0
BLAKE2b-256 66edc02c31e97982b9fc57a1306dc379eafac506245f052cbbc92164368e5130

See more details on using hashes here.

Provenance

The following attestation bundles were made for affine_gaps-0.2.4-py3-none-any.whl:

Publisher: release.yml on gata-bio/affine-gaps

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page