Fast and Versatile Alignments for Python

Project description

pyalign

Fast and simple alignments in Python:

Example inside Jupyter

Alignments have been a staple algorithm in bioinformatics for decades now, but most packages implementing tend to be either easy to use and slow, or fast but very difficult to use and highly domain specific.

pyalign is a small and hopefully rather versatile Python package that aims to be fast and easy to use. At its core, it is an optimizer for finding "optimum correspondences between sequences" (Kruskal, 1983) - the main proponents of which are alignments and dynamic time warping.

General Features:

easy to install and easy to use
robust and efficient implementation of standard algorithms
very fast for smaller problem sizes (see below for details)
built-in visualization functionality for teaching purposes

In terms of alignment algorithms:

computes local, global and semiglobal alignments on pairs of sequences
supports different gap costs (commonly used ones as well as custom ones)
automatically selects best suitable algorithm (e.g. Gotoh)
no assumptions on matched items, i.e. not limited to characters
supports any given similarity or distance function (i.e. can maximize or minimize)
can return one as well as all optimal alignments and scores

The implementation should be rather fast due to highly optimized code paths for every special case. While it does not support GPUs, here are some facts:

optimized C++ core employing xtensor
supports SIMD via batching (i.e. simple SIMD parallelism as first suggested by Alpern et al. and more recently by Rudnicki et al.)
carefully designed to avoid dynamic memory allocation
extensive metaprogramming to provide different optimized code paths for different usage patterns - for example, computing "only single score" won't write tracebacks, whereas computing "all alignments" will track multiple traceback edges

Installation

via pip (recommended)

pyalign currently provides precompiled packages for Windows (Intel), Linux (Intel) and macOS (Intel).

pip install pyalign

locally

Installing pyalign locally will require a modern C++ compiler. It also requires various libraries from the xtensor stack which are best installed via cona; for the full list of required packages, see environment.yml.

Local installation via conda:

git clone https://github.com/poke1024/pyalign
cd pyalign
conda env create -f environment.yml
conda activate pyalign
python setup.py install

Example

Running

import pyalign
alignment = pyalign.global_alignment("INDUSTRY", "INTEREST", gap_cost=0, eq=1, ne=-1)
alignment

will compute an optimal global alignment between "INDUSTRY" and "INTEREST".

We instruct the optimizer to use of scores 1 and -1 (for matching and non-matching letters) and no (i.e. 0) gap costs.

In Jupyter, this will give

IN----DUSTRY
||      ||  
INTERE--ST--

Of course you can also extract the actual score:

alignment.score

4.0

It's also possible to extract the traceback matrix and path and generate visuals (and thus a detailed rationale for the obtained score and solution).

In contrast to the first example above, which used the simplified high level API, we now use the full, more detailed API, which gives much more detailed access to different gap costs, solvers and scoring configurations. To make things a bit more interesting, we switch from 0 gap cost to 0.2 (which will not change the result in this case, but shows in the traceback matrix):

import pyalign.problems
import pyalign.solve
import pyalign.gaps

pf = pyalign.problems.general(
    pyalign.problems.Equality(eq=1, ne=-1),
    direction="maximize")
solver = pyalign.solve.GlobalSolver(
    gap_cost=pyalign.gaps.LinearGapCost(0.2),
    codomain=pyalign.solve.Solution)
problem = pf.new_problem("INDUSTRY", "INTEREST")
solver.solve(problem)

traceback and path

As a final example, here is how we would modify the solver above to get a list over all optimal solutions of a problem:

from typing import Iterator

solver = pyalign.solve.GlobalSolver(
    gap_cost=pyalign.gaps.LinearGapCost(0.2),
    codomain=List[pyalign.solve.Solution])

This will now return a list of solutions, each with its own traceback, e.g.:

[<pyalign.solve.Solution at 0x10f0f15b0>,
 <pyalign.solve.Solution at 0x10f0f1580>]

To learn more about the API, take a look at

Performance

Here are a few benchmarks. The "pure python" implementation seen in this benchmark is found at https://github.com/eseraygun/python-alignment.

+alphabet means using pyalign.problems.alphabetic instead of the simpler pyalign.problems.general to construct a problem.

+SIMD means feeding groups of equally-structured aligment problems into one solve call by using pyalign.problem.ProblemBatch - doing this will internally make use of AVX2 on Intel and Neon on ARM processors.

The following benchmarks were done on an Apple M1 Max. SIMD-128 refers to the M1's 128-bit SIMD.

The benchmark code can be found under benchmark.py.

traceback and path

Other Alignment Libraries

Here is a short overview of other libraries.

great bio alignment libraries

similar to pyalign

for large scale problems

What you will not find in pyalign:

SIMD acceleration for single pairs of sequences as in e.g. (Farrar 2007)
GPU acceleration, see e.g. (Barnes, 2020)
approximate or randomized algorithms
advanced preprocessing or indexing

If you need any of the above, you might want to take a look at:

References

Original Works

Altschul, S. (1998). Generalized affine gap costs for protein sequence alignment. Proteins: Structure, 32.

Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecular Biology, 162(3), 705–708. https://doi.org/10.1016/0022-2836(82)90398-9

Sankoff, D. (1972). Matching Sequences under Deletion/Insertion Constraints. Proceedings of the National Academy of Sciences, 69(1), 4–6. https://doi.org/10.1073/pnas.69.1.4

Smith, T. F., & Waterman, M. S. (1981). Identification of common molecular subsequences. Journal of Molecular Biology, 147(1), 195–197. https://doi.org/10.1016/0022-2836(81)90087-5

Miller, W., & Myers, E. W. (1988). Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology, 50(2), 97–120. https://doi.org/10.1007/BF02459948

Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. https://doi.org/10.1016/0022-2836(70)90057-4

Waterman, M. S., Smith, T. F., & Beyer, W. A. (1976). Some biological sequence metrics. Advances in Mathematics, 20(3), 367–387. https://doi.org/10.1016/0001-8708(76)90202-4

Waterman, M. S. (1984). Efficient sequence alignment algorithms. Journal of Theoretical Biology, 108(3), 333–337. https://doi.org/10.1016/S0022-5193(84)80037-5

Other Algorithms

Chakraborty, A., & Bandyopadhyay, S. (2013). FOGSAA: Fast Optimal Global Sequence Alignment Algorithm. Scientific Reports, 3(1), 1746. https://doi.org/10.1038/srep01746

Surveys

Aluru, S. (Ed.). (2005). Handbook of Computational Molecular Biology. Chapman and Hall/CRC. https://doi.org/10.1201/9781420036275

Stojmirović, A., & Yu, Y.-K. (2009). Geometric Aspects of Biological Sequence Comparison. Journal of Computational Biology, 16(4), 579–610. https://doi.org/10.1089/cmb.2008.0100

Kruskal, J. B. (1983). An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules. SIAM Review, 25(2), 201–237. https://doi.org/10.1137/1025045

Müller, M. (2007). Information Retrieval for Music and Motion. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-74048-3

Implementations

Alpern, B., Carter, L., & Su Gatlin, K. (1995). Microparallelism and high-performance protein matching. Proceedings of the 1995 ACM/IEEE Conference on Supercomputing (CDROM) - Supercomputing ’95, 24-es. https://doi.org/10.1145/224170.224222

Barnes, R. (2020). A Review of the Smith-Waterman GPU Landscape. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-152.html

Farrar, M. (2007). Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics, 23(2), 156–161. https://doi.org/10.1093/bioinformatics/btl582

Flouri, T., Kobert, K., Rognes, T., & Stamatakis, A. (2015). Are all global alignment algorithms and implementations correct? [Preprint]. Bioinformatics. https://doi.org/10.1101/031500

Rognes, T. (2011). Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics, 12(1), 221. https://doi.org/10.1186/1471-2105-12-221

Rudnicki, W. R., Jankowski, A., Modzelewski, A., Piotrowski, A., & Zadrożny, A. (2009). The new SIMD Implementation of the Smith-Waterman Algorithm on Cell Microprocessor. Fundamenta Informaticae, 96(1–2), 181–194. https://doi.org/10.3233/FI-2009-173

Tran, T. T., Liu, Y., & Schmidt, B. (2016). Bit-parallel approximate pattern matching: Kepler GPU versus Xeon Phi. 26th International Symposium on Computer Architecture and High Performance Computing, 54, 128–138. https://doi.org/10.1016/j.parco.2015.11.001

Project details

Release history Release notifications | RSS feed

This version

0.4.4

Nov 30, 2022

0.4.3

Jan 21, 2022

0.4.2

Jan 20, 2022

0.4.1

Jan 18, 2022

0.4

Dec 20, 2021

0.3.9

Dec 9, 2021

0.3.8

Nov 25, 2021

0.3.7

Nov 24, 2021

0.3.6

Sep 21, 2021

0.3.5

Sep 20, 2021

0.3.3

Jul 29, 2021

0.3.1

Jul 29, 2021

0.3

Jul 20, 2021

0.2.2

Jul 10, 2021

0.2.1

Jul 10, 2021

0.2

Jul 10, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyalign-0.4.4.tar.gz (50.1 kB view details)

Uploaded Nov 30, 2022 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyalign-0.4.4-cp39-cp39-manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded Nov 30, 2022 CPython 3.9

pyalign-0.4.4-cp38-cp38-manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded Nov 30, 2022 CPython 3.8

pyalign-0.4.4-cp37-cp37m-manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded Nov 30, 2022 CPython 3.7m

File details

Details for the file pyalign-0.4.4.tar.gz.

File metadata

Download URL: pyalign-0.4.4.tar.gz
Upload date: Nov 30, 2022
Size: 50.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pyalign-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`430b55c950a7d219ff922cd7d3a21f7b24a28279400cd87cd8f73c0948e88be6`
MD5	`dd8b796af57491c1c6a1929dcdbcaf82`
BLAKE2b-256	`e044507f5102ebb021e4c7d90b0b7fd76357385a361237f25e37f159610b03e7`

See more details on using hashes here.

File details

Details for the file pyalign-0.4.4-cp39-cp39-manylinux2014_x86_64.whl.

File metadata

Download URL: pyalign-0.4.4-cp39-cp39-manylinux2014_x86_64.whl
Upload date: Nov 30, 2022
Size: 4.6 MB
Tags: CPython 3.9
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pyalign-0.4.4-cp39-cp39-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`ab10fad108ddf61190353ac432b0104a010f9982d65fd2f8741574e775afe21e`
MD5	`be0e32a717f99cc9489edd9902548eda`
BLAKE2b-256	`4ca70d1a0a0c615a76d95e3080e7ba402ec0cfbae5f12547cf56d23de19cfc75`

See more details on using hashes here.

File details

Details for the file pyalign-0.4.4-cp38-cp38-manylinux2014_x86_64.whl.

File metadata

Download URL: pyalign-0.4.4-cp38-cp38-manylinux2014_x86_64.whl
Upload date: Nov 30, 2022
Size: 4.6 MB
Tags: CPython 3.8
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pyalign-0.4.4-cp38-cp38-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`de89eacd705df2149f5b64a60a6a437232bc9bc4eac6e2615b48a3e711c86e29`
MD5	`e8783360ce134fd6a5cc36745d46fb11`
BLAKE2b-256	`a8c7e3e82b5f20acc4e7ff8913c28ccc383b06931d4a1d759dec78d733448b1a`

See more details on using hashes here.

File details

Details for the file pyalign-0.4.4-cp37-cp37m-manylinux2014_x86_64.whl.

File metadata

Download URL: pyalign-0.4.4-cp37-cp37m-manylinux2014_x86_64.whl
Upload date: Nov 30, 2022
Size: 4.6 MB
Tags: CPython 3.7m
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.15

File hashes

Hashes for pyalign-0.4.4-cp37-cp37m-manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`887c44489a20a312fa929ca3f879cfd261c8b0a504861da021e9589d41fad56b`
MD5	`ef141574a2ae7a401a53e966ea472992`
BLAKE2b-256	`10e1baa849bfdf6f2ae44b551c5d25e61e9995ed8a4e96c689526c89bdc0bf71`

See more details on using hashes here.

pyalign 0.4.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

pyalign

Installation

via pip (recommended)

locally

Example

Performance

Other Alignment Libraries

great bio alignment libraries

similar to pyalign

for large scale problems

References

Original Works

Other Algorithms

Surveys

Implementations

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes