Skip to main content

Align sequences using WFA2-lib

Project description

A python wrapper for wavefront alignment using WFA2-lib

Installation

To download from pypi:

pip install pywfa

From conda:

conda install -c bioconda pywfa

Build from source:

git clone https://github.com/kcleal/pywfa
cd pywfa
pip install .

Overview

Alignment of pattern and text strings can be performed by accessing WFA2-lib functions directly:

from pywfa import WavefrontAligner

pattern = "TCTTTACTCGCGCGTTGGAGAAATACAATAGT"
text =    "TCTATACTGCGCGTTTGGAGAAATAAAATAGT"
a = WavefrontAligner(pattern)
score = a.wavefront_align(text)
assert a.status == 0  # alignment was successful
assert a.cigarstring == "3M1X4M1D7M1I9M1X6M"
assert a.score == -24
a.cigartuples
>>> [(0, 3), (8, 1), (0, 4), (2, 1), (0, 7), (1, 1), (0, 9), (8, 1), (0, 6)]
a.cigar_print_pretty()
>>> 3M1X4M1D7M1I9M1X6M      ALIGNMENT
    1X1D1I1X      ALIGNMENT.COMPACT
    PATTERN    TCTTTACTCGCGCGTT-GGAGAAATACAATAGT
               ||| |||| ||||||| ||||||||| ||||||
    TEXT       TCTATACT-GCGCGTTTGGAGAAATAAAATAGT

The output of cigar_pretty_print can be directed to a file, rather than stdout using:

a.cigar_print_pretty("file.txt")

To obtain a python str of this print out, access the results object (see below).

Cigartuples follow the convention:

Operation

Code

M

0

I

1

D

2

N

3

S

4

H

5

=

7

X

8

B

9

For convenience, a results object can be obtained by calling the WavefrontAligner with a pattern and text:

pattern = "TCTTTACTCGCGCGTTGGAGAAATACAATAGT"
text =    "TCTATACTGCGCGTTTGGAGAAATAAAATAGT"
a = WavefrontAligner(pattern)
result = a(text)  # alignment result
result.__dict__
>>> {'pattern_length': 32, 'text_length': 32, 'pattern_start': 0, 'pattern_end': 32, 'text_start': 0, 'text_end': 32, 'cigartuples': [(0, 3), (8, 1), (0, 4), (2, 1), (0, 7), (1, 1), (0, 9), (8, 1), (0, 6)], 'score': -24, 'pattern': 'TCTTTACTCGCGCGTTGGAGAAATACAATAGT', 'text': 'TCTATACTGCGCGTTTGGAGAAATAAAATAGT', 'status': 0}

# Alignment can also be called with a pattern like this:
a(text, pattern)

# obtain a string in the same format as cigar_print_pretty
a.pretty
>>> 3M1X4M1D7M1I9M1X6M      ALIGNMENT
    1X1D1I1X      ALIGNMENT.COMPACT
          PATTERN    TCTTTACTCGCGCGTT-GGAGAAATACAATAGT
                     |||*|||| ||||||| |||||||||*||||||
          TEXT       TCTATACT-GCGCGTTTGGAGAAATAAAATAGT

Configure

To configure the WaveFrontAligner, options can be provided during initialization:

from pywfa import WavefrontAligner

a = WavefrontAligner(scope="score",
                     distance="affine2p",
                     span="end-to-end",
                     heuristic="adaptive")

Supported distance metrics are “affine” (default) and “affine2p”. Scope can be “full” (default) or “score”. Span can be “ends-free” (default) or “end-to-end”. Heuristic can be None (default), “adaptive” or “X-drop”.

When using heuristic functions it is recommended to check the status attribute:

pattern = "AAAAACCTTTTTAAAAAA"
text = "GGCCAAAAACCAAAAAA"
a = WavefrontAligner(heuristic="adaptive")
a(pattern, text)
a.status
>>> 0   # successful alignment, -1 indicates the alignment was stopped due to the heuristic

Default options

The WavefrontAligner will be initialized with the following default options:

Parameter

Default value

pattern

None

distance

“affine”

match

0

gap_opening

6

gep_extension

2

gap_opening2

24

gap_extension2

1

scope

“full”

span

“ends-free”

pattern_begin_free

0

pattern_end_free

0

text_begin_free

0

text_end_free

0

heuristic

None

min_wavefront_length

10

max_distance_threshold

50

steps_between_cutoffs

1

xdrop

20

Modifying the cigar

If desired the cigar can be modified so the end operation is either a soft-clip or a match, this makes the alignment cigar resemble those produced by bwa, for example:

pattern = "AAAAACCTTTTTAAAAAA"
text = "GGCCAAAAACCAAAAAA"
a = WavefrontAligner(pattern)

res = a(text, clip_cigar=False)
print(cigartuples_to_str(res.cigartuples))
>>> 4I7M5D6M

res(text, clip_cigar=True)
print(cigartuples_to_str(res.cigartuples))
>>> 4S7M5D6M

An experimental feature is to trim short matches at the end of alignments. This results in alignments that approximate local alignments:

pattern = "AAAAAAAAAAAACCTTTTAAAAAAGAAAAAAA"
text = "ACCCCCCCCCCCAAAAACCAAAAAAAAAAAAA"
a = WavefrontAligner(pattern)

# The unmodified cigar may have short matches at the end:
res = a(text, clip_cigar=False)
res.cigartuples
>>> [(0, 1), (1, 5), (8, 6), (0, 7), (2, 5), (0, 5), (8, 1), (0, 7)]
res.aligned_text
>>> ACCCCCCCCCCCAAAAACCAAAAAAAAAAAAA
res.text_start, res.text_end
>>> 0, 32

# The minimum allowed block of matches can be set at e.g. 5 bp, which will trim off short matches
res = a(text, clip_cigar=True, min_aligned_bases_left=5, min_aligned_bases_right=5)
res.cigartuples
>>> [(4, 12), (0, 7), (2, 5), (0, 5), (8, 1), (0, 7)]
res.aligned_text
>>> AAAAACCAAAAAAAAAAAAA
res.text_start, res.text_end
>>> 12, 32

# Mismatch operations X can also be elided, note this occurs after the clip_cigar stage
res = a(text, clip_cigar=True, min_aligned_bases_left=5, min_aligned_bases_right=5, elide_mismatches=True)
res.cigartuples
>>> [(4, 12), (0, 7), (2, 5), (0, 13)]
res.aligned_text
>>> AAAAACCAAAAAAAAAAAAA

Notes: The alignment score is not modified currently by trimming the cigar, however the pattern_start, pattern_end, test_start and text_end are modified when the cigar is modified.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pywfa-0.5.1.tar.gz (3.5 MB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page