Needleman-Wunsch Quality-Aware Sequence Alignment

## Project description

This pacakge implements a Needleman–Wunsch global alignment with affine gap penalties, taking into account the confidence in each base pair of the read sequence.

The classic Needleman-Wunsch algorithm finds the optimal global match assuming all read nucleotides have been identified with certainty. In practice, NGS (next generation sequencing) identifies nucleotides with varying levels of qualities, and popular formats, e.g., FASTQ are expressely built for describing these qualities. This package modifies the algorithm to take qualities into account (see mathematical_rationale below).

This package is designed to be used in conjunction with a read aligner such as Bowtie2 or GEM. Ideally, these tools would be configured to take basepair quality into account when searching for hits. Since this is currently not the case, one can use the allein_zu_haus package in order to realign reads to reference sequences at the positions reported by the read aligner, and extract more accurate alignment scores and CIGAR strings.

The package handles ambiguous base codes by weighing the different options according to the base priors provided. Read sequences must be drawn from the {A, C, G, T, N} alphabet, but reference sequences may contain any basepairs defined by the IUPAC nucleotide ambiguity code. Quality values are ignored for ambiguous read basepairs.

## Usage example:

### Minimal Example

import allein_zu_haus
import numpy as np

max_ref_len = 2 * max_read_len # Max length of reference subsequence to be globally aligned (where start position is determined by read aligner output)
mismatch_penalty = 4           # Penalty for mismatched nucleotides
gap_open = 6                   # Gep opening penalty
gep_extend = 3                 # Gap extension penalty
aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)

read = np.array(['A', 'C', 'G', 'T', 'A'], dtype=bytes) # Read sequence, given as numpy array, dtype=bytes
ref = np.array(['A', 'G', 'G', 'T', 'A'], dtype=bytes)  # Reference subsequence to be aligned, given as numpy array, dtype=bytes
read_bp_probs = np.array([0.9, 0.99, 0.8, 0.99, 0.99])  # Confidence in each read basepair. See below on how to extract such values from read quality strings
base_probs = np.array([0.25] * 4)                       # Basepair prior probabilities. Here, assuming uniform distribution on nucleotides.
# See below example for more biologically relevant priors
max_gaps = 2                                            # Maximal number of gaps allowed. Use small values to improve run time

score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) #Aligner returns alignment score and CIGAR string

### Setting Up Priors & Qualities

One way of extracting more meaningful basepair priors is by using the nucleotide frequencies observed in reference sequences. Assuming _ref_seq is a string variable holding the reference genome of interest this can be done by:

def get_priors():
return [(k, v / float(len(_ref_seq))) for (k, v) in collections.Counter(_ref_seq).items()]

Naturally, more intricate priors can easily be designed.

Basepair probabilities cab be extracted from quality strings reported in FASTQ files / read aligner output using this formula:

\begin{equation*} 1 - 10^\frac{-q}{10} \end{equation*}

where q is the ascii value of the quality character - 33 (assuming qualities are goven in Phred+33)

def get_read_bp_probs(read_quality_string): # read quality_string: FASTQ quality string as reported in FASTQ file or read aligner output
return 1 - np.power(10, -np.array([ord(e) - 33 for e in a]) / 10.)

As you can see, read_p now holds per basepair probabilities (1 - P_error). For example, for quality string ‘??CII’ output will be [0.999, 0.999, 0.996, 0.9999, 0.9999]

This package is optimized for multiple alignments, e.g., when iterating over the results of GEM or a FASTQ file, and checking the score of each result relative to some corresponding reference subsequence. For this reason, to use it, first create an object with the parameters relevant to all matches. Since reads usually have similar lengths, and since read aligners provide the position in the reference sequence to which the read was aligned, both length of read and reference sequence can be easily bound.

import allein_zu_haus
import numpy as np

mismatch_penalty = 4
gap_open = 6
gep_extend = 3
aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)

Then use the aligner repeatedly for each match, providing only the match specific parameters:

for read, read_quality_string, ref, max_gaps in ...:
# read, ref should be of type np.array with dtype=bytes
base_probs = get_priors()                              # Function for computing priors from reference and/or read sequences. See example above

score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) # multiple calls to aligner, with relevant sequences and priors

## Mathematical Rationale

Suppose we wish to find the optimal match between a read D and a reference F. Unfortunately, we cannot observe D directly, and instead only see D’, which is the sequence outputted by some imperfect sequencing process. Say that at some point in the alignment algorithm we consider whether a nucleotide from D’ matches a nucleotide from F. Define:

• b D’ is the nucleotide reported by the sequencing process.

• b D is the true (unknown) uncleotide.

• b R is the reference nucleotide.

The classic algorithm would assign the penalty

\begin{equation*} \mbox{penalty}(b_{D'}, b_R) \end{equation*}

whereas the correct penalty should be

\begin{equation*} \mbox{penalty}(b_{D}, b_R) \simeq \sum_b \left[ P\left( B_D = b | B_{D'} = b_{D'} \right) \cdot \mbox{penalty}(b, b_R) \right] \end{equation*}
\begin{equation*} P\left( B_D = b | B_{D'} = b_{D'} \right) = \frac { P\left( B_{D'} = b_{D'} | B_D = b \right) \cdot P\left( B_D = b \right) } { \sum_{k = 'A', 'C', 'G', 'T} P\left( B_{D'} = b_{D'} | B_D = k \right) \cdot P\left( B_D = k \right) } \end{equation*}

For evaluating these terms, note that

\begin{equation*} P\left( B_D = b \right) \end{equation*}

is the prior over the nucleotides (which must be given by the user), and

\begin{equation*} P\left( B_{D'} = b_{D'} | B_D = b \right) = \begin{cases} 1 - P_{\mbox{err}} ,& \text{if } b_{D'} = b, \\ \frac{P_{\mbox{err}}}{3}, & \text{otherwise} \end{cases} \end{equation*}

where P err is the probability for error determined by the reported quality for this nucleotide.

## Issues

Feel free to open tickets at https://bitbucket.org/taliraveh/allein_zu_haus/issues.