Needleman-Wunsch Quality-Aware Sequence Alignment

## Project description

=================================================================

allein_zu_haus: Needleman-Wunsch Quality-Aware Sequence Alignment

=================================================================

This pacakge implements a `Needleman–Wunsch <https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm>`_

global alignment with `affine gap penalties <https://en.wikipedia.org/wiki/Gap_penalty#Affine>`_, taking into account the confidence in each base pair of the read sequence.

The classic Needleman-Wunsch algorithm finds the optimal global match assuming all read nucleotides have been identified with certainty. In practice,

`NGS (next generation sequencing) <http://www.illumina.com/technology/next-generation-sequencing.html>`_ identifies nucleotides with varying levels of qualities, and popular formats, e.g., `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ are expressely built for describing these qualities. This package modifies the algorithm to take qualities into account (see mathematical_rationale_ below).

This package is designed to be used in conjunction with a read aligner such as `Bowtie2 <http://bowtie-bio.sourceforge.net/bowtie2/index.shtml>`_ or `GEM <http://www.paoloribeca.net/software/GEM/>`_. Ideally, these tools would be configured to take basepair quality into account when searching for hits. Since this is currently not the case, one can use the allein_zu_haus package in order to realign reads to reference sequences at the positions reported by the read aligner, and extract more accurate alignment scores and `CIGAR <https://samtools.github.io/hts-specs/SAMv1.pdf>`_ strings.

The package handles ambiguous base codes by weighing the different options according to the base priors provided. Read sequences must be drawn from the {A, C, G, T, N} alphabet, but reference sequences may contain any basepairs defined by the IUPAC nucleotide ambiguity code. Quality values are ignored for ambiguous read basepairs.

Usage example:

--------------

Minimal Example

~~~~~~~~~~~~~~~

.. code:: python

import allein_zu_haus

import numpy as np

max_read_len = 100 # Max length of read to be aligned

max_ref_len = 2 * max_read_len # Max length of reference subsequence to be globally aligned (where start position is determined by read aligner output)

mismatch_penalty = 4 # Penalty for mismatched nucleotides

gap_open = 6 # Gep opening penalty

gep_extend = 3 # Gap extension penalty

aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)

read = np.array(['A', 'C', 'G', 'T', 'A'], dtype=bytes) # Read sequence, given as numpy array, dtype=bytes

ref = np.array(['A', 'G', 'G', 'T', 'A'], dtype=bytes) # Reference subsequence to be aligned, given as numpy array, dtype=bytes

read_bp_probs = np.array([0.9, 0.99, 0.8, 0.99, 0.99]) # Confidence in each read basepair. See below on how to extract such values from read quality strings

base_probs = np.array([0.25] * 4) # Basepair prior probabilities. Here, assuming uniform distribution on nucleotides.

# See below example for more biologically relevant priors

max_gaps = 2 # Maximal number of gaps allowed. Use small values to improve run time

score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) #Aligner returns alignment score and CIGAR string

Setting Up Priors & Qualities

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One way of extracting more meaningful basepair priors is by using the nucleotide frequencies observed in reference sequences. Assuming _ref_seq is a string variable holding the reference genome of interest this can be done by:

.. code:: python

def get_priors():

return [(k, v / float(len(_ref_seq))) for (k, v) in collections.Counter(_ref_seq).items()]

Naturally, more intricate priors can easily be designed.

Basepair probabilities cab be extracted from quality strings reported in FASTQ files / read aligner output using this formula:

.. math::

1 - 10^\frac{-q}{10}

where q is the ascii value of the quality character - 33 (assuming qualities are goven in Phred+33)

.. code:: python

def get_read_bp_probs(read_quality_string): # read quality_string: FASTQ quality string as reported in FASTQ file or read aligner output

return 1 - np.power(10, -np.array([ord(e) - 33 for e in a]) / 10.)

As you can see, read_p now holds per basepair probabilities (1 - P_error). For example, for quality string '??CII' output will be [0.999, 0.999, 0.996, 0.9999, 0.9999]

Aligning multiple reads

~~~~~~~~~~~~~~~~~~~~~~~

This package is optimized for multiple alignments, e.g., when iterating over the results of `GEM <http://www.paoloribeca.net/software/GEM/>`_ or a FASTQ file, and checking the score of each result relative to some corresponding reference subsequence. For this reason, to use it, first create an object with the parameters relevant to all matches.

Since reads usually have similar lengths, and since read aligners provide the position in the reference sequence to which the read was aligned, both length of read and reference sequence can be easily bound.

.. code:: python

import allein_zu_haus

import numpy as np

max_read_len = 100

max_ref_len = 2 * max_read_len

mismatch_penalty = 4

gap_open = 6

gep_extend = 3

aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)

Then use the aligner repeatedly for each match, providing only the match specific parameters:

.. code:: python

for read, read_quality_string, ref, max_gaps in ...:

# read, ref should be of type np.array with dtype=bytes

read_bp_probs = get_read_bp_probs(read_quality_string) # Function for exatracting basepair confidence from FASTQ quality strings. See example above.

base_probs = get_priors() # Function for computing priors from reference and/or read sequences. See example above

score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) # multiple calls to aligner, with relevant sequences and priors

.. _mathematical_rationale:

Mathematical Rationale

----------------------

Suppose we wish to find the optimal match between a read *D* and a reference *F*. Unfortunately, we cannot observe *D* directly, and instead only see *D'*, which is the sequence outputted by some imperfect `sequencing process <https://en.wikipedia.org/wiki/DNA_sequencing>`_. Say that at some point in the alignment algorithm we consider whether a nucleotide from *D'* matches a nucleotide from *F*. Define:

* *b* :sub:`D'` is the nucleotide reported by the sequencing process.

* *b* :sub:`D` is the true (unknown) uncleotide.

* *b* :sub:`R` is the reference nucleotide.

The classic algorithm would assign the penalty

.. math::

\mbox{penalty}(b_{D'}, b_R)

whereas the correct penalty should be

.. math::

\mbox{penalty}(b_{D}, b_R) \simeq

\sum_b \left[ P\left( B_D = b | B_{D'} = b_{D'} \right) \cdot \mbox{penalty}(b, b_R) \right]

By `Bayes' Theorem <https://en.wikipedia.org/wiki/Bayes%27_theorem>`_,

.. math::

P\left( B_D = b | B_{D'} = b_{D'} \right)

=

\frac

{

P\left( B_{D'} = b_{D'} | B_D = b \right)

\cdot

P\left( B_D = b \right)

}

{

\sum_{k = 'A', 'C', 'G', 'T}

P\left( B_{D'} = b_{D'} | B_D = k \right)

\cdot

P\left( B_D = k \right)

}

For evaluating these terms, note that

.. math::

P\left( B_D = b \right)

is the prior over the nucleotides (which must be given by the user), and

.. math::

P\left( B_{D'} = b_{D'} | B_D = b \right)

=

\begin{cases}

1 - P_{\mbox{err}} ,& \text{if } b_{D'} = b, \\

\frac{P_{\mbox{err}}}{3}, & \text{otherwise}

\end{cases}

where *P* :sub:`err` is the probability for error determined by the reported quality for this nucleotide.

Issues

------

Feel free to open tickets at `https://bitbucket.org/taliraveh/allein_zu_haus/issues <https://bitbucket.org/taliraveh/allein_zu_haus/issues>`_.

allein_zu_haus: Needleman-Wunsch Quality-Aware Sequence Alignment

=================================================================

This pacakge implements a `Needleman–Wunsch <https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm>`_

global alignment with `affine gap penalties <https://en.wikipedia.org/wiki/Gap_penalty#Affine>`_, taking into account the confidence in each base pair of the read sequence.

The classic Needleman-Wunsch algorithm finds the optimal global match assuming all read nucleotides have been identified with certainty. In practice,

`NGS (next generation sequencing) <http://www.illumina.com/technology/next-generation-sequencing.html>`_ identifies nucleotides with varying levels of qualities, and popular formats, e.g., `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ are expressely built for describing these qualities. This package modifies the algorithm to take qualities into account (see mathematical_rationale_ below).

This package is designed to be used in conjunction with a read aligner such as `Bowtie2 <http://bowtie-bio.sourceforge.net/bowtie2/index.shtml>`_ or `GEM <http://www.paoloribeca.net/software/GEM/>`_. Ideally, these tools would be configured to take basepair quality into account when searching for hits. Since this is currently not the case, one can use the allein_zu_haus package in order to realign reads to reference sequences at the positions reported by the read aligner, and extract more accurate alignment scores and `CIGAR <https://samtools.github.io/hts-specs/SAMv1.pdf>`_ strings.

The package handles ambiguous base codes by weighing the different options according to the base priors provided. Read sequences must be drawn from the {A, C, G, T, N} alphabet, but reference sequences may contain any basepairs defined by the IUPAC nucleotide ambiguity code. Quality values are ignored for ambiguous read basepairs.

Usage example:

--------------

Minimal Example

~~~~~~~~~~~~~~~

.. code:: python

import allein_zu_haus

import numpy as np

max_read_len = 100 # Max length of read to be aligned

max_ref_len = 2 * max_read_len # Max length of reference subsequence to be globally aligned (where start position is determined by read aligner output)

mismatch_penalty = 4 # Penalty for mismatched nucleotides

gap_open = 6 # Gep opening penalty

gep_extend = 3 # Gap extension penalty

aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)

read = np.array(['A', 'C', 'G', 'T', 'A'], dtype=bytes) # Read sequence, given as numpy array, dtype=bytes

ref = np.array(['A', 'G', 'G', 'T', 'A'], dtype=bytes) # Reference subsequence to be aligned, given as numpy array, dtype=bytes

read_bp_probs = np.array([0.9, 0.99, 0.8, 0.99, 0.99]) # Confidence in each read basepair. See below on how to extract such values from read quality strings

base_probs = np.array([0.25] * 4) # Basepair prior probabilities. Here, assuming uniform distribution on nucleotides.

# See below example for more biologically relevant priors

max_gaps = 2 # Maximal number of gaps allowed. Use small values to improve run time

score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) #Aligner returns alignment score and CIGAR string

Setting Up Priors & Qualities

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One way of extracting more meaningful basepair priors is by using the nucleotide frequencies observed in reference sequences. Assuming _ref_seq is a string variable holding the reference genome of interest this can be done by:

.. code:: python

def get_priors():

return [(k, v / float(len(_ref_seq))) for (k, v) in collections.Counter(_ref_seq).items()]

Naturally, more intricate priors can easily be designed.

Basepair probabilities cab be extracted from quality strings reported in FASTQ files / read aligner output using this formula:

.. math::

1 - 10^\frac{-q}{10}

where q is the ascii value of the quality character - 33 (assuming qualities are goven in Phred+33)

.. code:: python

def get_read_bp_probs(read_quality_string): # read quality_string: FASTQ quality string as reported in FASTQ file or read aligner output

return 1 - np.power(10, -np.array([ord(e) - 33 for e in a]) / 10.)

As you can see, read_p now holds per basepair probabilities (1 - P_error). For example, for quality string '??CII' output will be [0.999, 0.999, 0.996, 0.9999, 0.9999]

Aligning multiple reads

~~~~~~~~~~~~~~~~~~~~~~~

This package is optimized for multiple alignments, e.g., when iterating over the results of `GEM <http://www.paoloribeca.net/software/GEM/>`_ or a FASTQ file, and checking the score of each result relative to some corresponding reference subsequence. For this reason, to use it, first create an object with the parameters relevant to all matches.

Since reads usually have similar lengths, and since read aligners provide the position in the reference sequence to which the read was aligned, both length of read and reference sequence can be easily bound.

.. code:: python

import allein_zu_haus

import numpy as np

max_read_len = 100

max_ref_len = 2 * max_read_len

mismatch_penalty = 4

gap_open = 6

gep_extend = 3

aligner = allein_zu_haus.Aligner(max_read_len, max_ref_len, mismatch_penalty, gap_open, gap_extend)

Then use the aligner repeatedly for each match, providing only the match specific parameters:

.. code:: python

for read, read_quality_string, ref, max_gaps in ...:

# read, ref should be of type np.array with dtype=bytes

read_bp_probs = get_read_bp_probs(read_quality_string) # Function for exatracting basepair confidence from FASTQ quality strings. See example above.

base_probs = get_priors() # Function for computing priors from reference and/or read sequences. See example above

score, cigar = aligner.match(read, read_bp_probs, base_probs, ref, max_gaps) # multiple calls to aligner, with relevant sequences and priors

.. _mathematical_rationale:

Mathematical Rationale

----------------------

Suppose we wish to find the optimal match between a read *D* and a reference *F*. Unfortunately, we cannot observe *D* directly, and instead only see *D'*, which is the sequence outputted by some imperfect `sequencing process <https://en.wikipedia.org/wiki/DNA_sequencing>`_. Say that at some point in the alignment algorithm we consider whether a nucleotide from *D'* matches a nucleotide from *F*. Define:

* *b* :sub:`D'` is the nucleotide reported by the sequencing process.

* *b* :sub:`D` is the true (unknown) uncleotide.

* *b* :sub:`R` is the reference nucleotide.

The classic algorithm would assign the penalty

.. math::

\mbox{penalty}(b_{D'}, b_R)

whereas the correct penalty should be

.. math::

\mbox{penalty}(b_{D}, b_R) \simeq

\sum_b \left[ P\left( B_D = b | B_{D'} = b_{D'} \right) \cdot \mbox{penalty}(b, b_R) \right]

By `Bayes' Theorem <https://en.wikipedia.org/wiki/Bayes%27_theorem>`_,

.. math::

P\left( B_D = b | B_{D'} = b_{D'} \right)

=

\frac

{

P\left( B_{D'} = b_{D'} | B_D = b \right)

\cdot

P\left( B_D = b \right)

}

{

\sum_{k = 'A', 'C', 'G', 'T}

P\left( B_{D'} = b_{D'} | B_D = k \right)

\cdot

P\left( B_D = k \right)

}

For evaluating these terms, note that

.. math::

P\left( B_D = b \right)

is the prior over the nucleotides (which must be given by the user), and

.. math::

P\left( B_{D'} = b_{D'} | B_D = b \right)

=

\begin{cases}

1 - P_{\mbox{err}} ,& \text{if } b_{D'} = b, \\

\frac{P_{\mbox{err}}}{3}, & \text{otherwise}

\end{cases}

where *P* :sub:`err` is the probability for error determined by the reported quality for this nucleotide.

Issues

------

Feel free to open tickets at `https://bitbucket.org/taliraveh/allein_zu_haus/issues <https://bitbucket.org/taliraveh/allein_zu_haus/issues>`_.