This is a pre-production deployment of Warehouse. Changes made here affect the production instance of PyPI (pypi.python.org).
Help us improve Python packaging - Donate today!

Needleman-Wunsch Quality-Aware Sequence Alignment

Project Description
=================================================================
allein_zu_haus: Needleman-Wunsch Quality-Aware Sequence Alignment
=================================================================

This pacakge implements a Needleman–Wunsch <https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm>_
global alignment with affine gap penalties <https://en.wikipedia.org/wiki/Gap_penalty#Affine>_, taking into account the confidence in each base pair of the read sequence.

The classic Needleman-Wunsch algorithm finds the optimal global match assuming all read nucleotides have been identified with certainty. In practice,
NGS (next generation sequencing) <http://www.illumina.com/technology/next-generation-sequencing.html>_ identifies nucleotides with varying levels of qualities, and popular formats, e.g., FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>_ are expressely built for describing these qualities. This package modifies the algorithm to take qualities into account (see mathematical_rationale_ below).

This package is designed to be used in conjunction with a read aligner such as Bowtie2 <http://bowtie-bio.sourceforge.net/bowtie2/index.shtml>_ or GEM <http://www.paoloribeca.net/software/GEM/>_. Ideally, these tools would be configured to take basepair quality into account when searching for hits. Since this is currently not the case, one can use the allein_zu_haus package in order to realign reads to reference sequences at the positions reported by the read aligner, and extract more accurate alignment scores and CIGAR <https://samtools.github.io/hts-specs/SAMv1.pdf>_ strings.

The package handles ambiguous base codes by weighing the different options according to the base priors provided. Read sequences must be drawn from the {A, C, G, T, N} alphabet, but reference sequences may contain any basepairs defined by the IUPAC nucleotide ambiguity code. Quality values are ignored for ambiguous read basepairs.

Usage example:
--------------

Minimal Example
~~~~~~~~~~~~~~~

.. code:: python

import allein_zu_haus
import numpy as np

max_ref_len = 2 * max_read_len # Max length of reference subsequence to be globally aligned (where start position is determined by read aligner output)
mismatch_penalty = 4 # Penalty for mismatched nucleotides
gap_open = 6 # Gep opening penalty
gep_extend = 3 # Gap extension penalty
aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)

read = np.array(['A', 'C', 'G', 'T', 'A'], dtype=bytes) # Read sequence, given as numpy array, dtype=bytes
ref = np.array(['A', 'G', 'G', 'T', 'A'], dtype=bytes) # Reference subsequence to be aligned, given as numpy array, dtype=bytes
read_bp_probs = np.array([0.9, 0.99, 0.8, 0.99, 0.99]) # Confidence in each read basepair. See below on how to extract such values from read quality strings
base_probs = np.array([0.25] * 4) # Basepair prior probabilities. Here, assuming uniform distribution on nucleotides.
# See below example for more biologically relevant priors
max_gaps = 2 # Maximal number of gaps allowed. Use small values to improve run time

score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) #Aligner returns alignment score and CIGAR string

Setting Up Priors & Qualities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One way of extracting more meaningful basepair priors is by using the nucleotide frequencies observed in reference sequences. Assuming _ref_seq is a string variable holding the reference genome of interest this can be done by:

.. code:: python

def get_priors():
return [(k, v / float(len(_ref_seq))) for (k, v) in collections.Counter(_ref_seq).items()]

Naturally, more intricate priors can easily be designed.

Basepair probabilities cab be extracted from quality strings reported in FASTQ files / read aligner output using this formula:

.. math::

1 - 10^\frac{-q}{10}

where q is the ascii value of the quality character - 33 (assuming qualities are goven in Phred+33)

.. code:: python

return 1 - np.power(10, -np.array([ord(e) - 33 for e in a]) / 10.)

As you can see, read_p now holds per basepair probabilities (1 - P_error). For example, for quality string '??CII' output will be [0.999, 0.999, 0.996, 0.9999, 0.9999]

~~~~~~~~~~~~~~~~~~~~~~~

This package is optimized for multiple alignments, e.g., when iterating over the results of GEM <http://www.paoloribeca.net/software/GEM/>_ or a FASTQ file, and checking the score of each result relative to some corresponding reference subsequence. For this reason, to use it, first create an object with the parameters relevant to all matches.
Since reads usually have similar lengths, and since read aligners provide the position in the reference sequence to which the read was aligned, both length of read and reference sequence can be easily bound.

.. code:: python

import allein_zu_haus
import numpy as np

mismatch_penalty = 4
gap_open = 6
gep_extend = 3
aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)

Then use the aligner repeatedly for each match, providing only the match specific parameters:

.. code:: python

# read, ref should be of type np.array with dtype=bytes
base_probs = get_priors() # Function for computing priors from reference and/or read sequences. See example above

score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) # multiple calls to aligner, with relevant sequences and priors

.. _mathematical_rationale:

Mathematical Rationale
----------------------

Suppose we wish to find the optimal match between a read *D* and a reference *F*. Unfortunately, we cannot observe *D* directly, and instead only see *D'*, which is the sequence outputted by some imperfect sequencing process <https://en.wikipedia.org/wiki/DNA_sequencing>_. Say that at some point in the alignment algorithm we consider whether a nucleotide from *D'* matches a nucleotide from *F*. Define:

* *b* :sub:D' is the nucleotide reported by the sequencing process.
* *b* :sub:D is the true (unknown) uncleotide.
* *b* :sub:R is the reference nucleotide.

The classic algorithm would assign the penalty

.. math::

\mbox{penalty}(b_{D'}, b_R)

whereas the correct penalty should be

.. math::

\mbox{penalty}(b_{D}, b_R) \simeq
\sum_b \left[ P\left( B_D = b | B_{D'} = b_{D'} \right) \cdot \mbox{penalty}(b, b_R) \right]

By Bayes' Theorem <https://en.wikipedia.org/wiki/Bayes%27_theorem>_,

.. math::

P\left( B_D = b | B_{D'} = b_{D'} \right)
=
\frac
{
P\left( B_{D'} = b_{D'} | B_D = b \right)
\cdot
P\left( B_D = b \right)
}
{
\sum_{k = 'A', 'C', 'G', 'T}
P\left( B_{D'} = b_{D'} | B_D = k \right)
\cdot
P\left( B_D = k \right)
}

For evaluating these terms, note that

.. math::

P\left( B_D = b \right)

is the prior over the nucleotides (which must be given by the user), and

.. math::

P\left( B_{D'} = b_{D'} | B_D = b \right)
=
\begin{cases}
1 - P_{\mbox{err}} ,& \text{if } b_{D'} = b, \\
\frac{P_{\mbox{err}}}{3}, & \text{otherwise}
\end{cases}

where *P* :sub:err is the probability for error determined by the reported quality for this nucleotide.

Issues
------

Feel free to open tickets at https://bitbucket.org/taliraveh/allein_zu_haus/issues <https://bitbucket.org/taliraveh/allein_zu_haus/issues>_.
Release History

This version

0.1.4

0.1.3

0.1.2

0.1.1

0.1.0