Needleman-Wunsch quality-aware sequence alignmenti, primarily for use with a full aligner like GEM.
Project description
This pacakge implements a Needleman–Wunsch global alignment with affine gap penalties, taking into account the confidence in each base pair of the read sequence.
The classic Needleman-Wunsch algorithm finds the optimal global match assuming all read nucleotides have been identified with certainty. In practice, NGS (next generation sequencing) identifies nucleotides with varying levels of qualities, and popular formats, e.g., FASTQ are expressely built for describing these qualities. This package modifies the algorithm to take qualities into account (see mathematical_rationale below).
This package is designed to be used in conjunction with a read aligner such as Bowtie2 or GEM. Ideally, these tools would be configured to take nucleotide quality into account when searching for hits. Since this is currently not the case, one can use the allein_zu_haus package in order to realign reads to reference sequences at the positions reported by the read aligner, and extract more accurate alignment scores and CIGAR strings.
Usage
To use this package, you would typically do three things:
estimate, per read, the nucleotide priors and read probabilities from the qualities
use an Aligner object to match a read to a reference
use 1. and 2. repeatedly for many pairs of reads and references
These steps are outlined below
Setting Up Priors & Qualities
For using this package, you need to provide as input the apiori probabilities of nucleotides, and the probabilities of the read symbols.
Perhaps the simplest approximation of nucleotide priors is the empirical distribution in the reference sequence. If ref is the reference sequence, then this can be found with:
import collections
return [(k, v / float(ref)) for (k, v) in collections.Counter(ref).items()]
Of course, more intricate priors can easily be designed.
Nucleotide probabilities cab be extracted from quality strings reported in FASTQ files / read aligner output using the formula
where q is the ascii value of the quality character - 33 (assuming qualities are given in Phred+33). In Python, assuming the read is in read, then the following finds the probabilities:
1 - np.power(10, -np.array([ord(e) - 33 for e in read]) / 10.)
For example, for quality string ‘??CII’, this will be be [0.999, 0.999, 0.996, 0.9999, 0.9999].
Matching a read/ref pair
Once you have all the inputs, you can match a pair of read and ref sequences.
import allein_zu_haus
import numpy as np
# Max length of read to be aligned
max_read_len = 100
# Max length of reference subsequence to be globally aligned (where start position is determined by read aligner output)
max_ref_len = 2 * max_read_len
# Penalty for mismatched nucleotides
mismatch_penalty = 4
# Gap opening penalty
gap_open_penalty = 6
# Gap extension penalty
gep_extend_penalty = 3
# Set up an aligner object
aligner = allein_zu_haus.Aligner(
max_read_len,
max_ref_len,
mismatch_penalty,
gap_open_penalty,
gap_extend_penalty)
# Assume read, and ref are sequences, calculate:
# * read_probs is the numpy array described above
# * base_probs is the dictionary described above
# Aligner.match(...)` returns alignment score and CIGAR string
score, cigar = aligner.match(read, read_probs, base_probs, ref)
Aligning multiple reads
When working with many read-alignment pairs (e.g., when iterating over the results of GEM or a FASTQ file), you would typically set up a single Aligner object, then iterate over each pair and call match.
import allein_zu_haus
import numpy as np
# Max length of read to be aligned
max_read_len = 100
# Max length of reference subsequence to be globally aligned (where start position is determined by read aligner output)
max_ref_len = 2 * max_read_len
# Penalty for mismatched nucleotides
mismatch_penalty = 4
# Gep opening penalty
gap_open_penalty = 6
# Gap extension penalty
gep_extend_penalty = 3
# Set up an aligner object
aligner = allein_zu_haus.Aligner(
max_read_len,
max_ref_len,
mismatch_penalty,
gap_open_penalty,
gap_extend_penalty)
# Iterate over all matches.
for read, read_quality_string, ref in ...:
# Calculate all probabilities for this match.
score, cigar = aligner.match(read, read_probs, base_probs, ref)
Mathematical Rationale
Suppose we wish to find the optimal match between a read D and a reference F. Unfortunately, we cannot observe D directly, and instead only see D’, which is the sequence outputted by some imperfect sequencing process. Say that at some point in the alignment algorithm we consider whether a nucleotide from D’ matches a nucleotide from F. Define:
b D’ is the nucleotide reported by the sequencing process.
b D is the true (unknown) uncleotide.
b R is the reference nucleotide.
The classic algorithm would assign the penalty
whereas the correct penalty should be
By Bayes’ Theorem,
For evaluating these terms, note that
is the prior over the nucleotides (which must be given by the user), and
where P err is the probability for error determined by the reported quality for this nucleotide.
Issues
Feel free to open tickets at https://bitbucket.org/taliraveh/allein_zu_haus/issues.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.