An algebraic parallel SNV caller using OpenCL

## Project description

ALPACA is a single nucleotide variant caller for next-generation sequencing providing intuitive control over the false discovery rate with generic sample filtering scenarios, leveraging OpenCL on CPU, GPU or any coprocessor to speed up calculations and an using HDF5 based persistent storage for iterative refinement of analyses within seconds.

## Introduction

In the following, advantages of ALPACA are outlined in detail. It solves three important challenges with current NGS variant calling pipelines.

• Often, variant calling entails filtering different samples against each other, e.g. disease samples vs. healthy samples, tumor vs. normal or children vs. parents. Usually, the null hypothesis of state of the art variant callers does not reflect the filtering. Hence, accurately controlling the false discovery rate (FDR) based on the provided variant qualities becomes impossible. We call this the FDR problem.

• Sample based filtering suffers from insufficient evidence in the filter samples. This can lead to variants in the target sample to remain although they are present, but badly covered in a filter sample. We call this the insufficient evidence problem.

• Joint variant calling can help with the insufficient evidence problem. Here, a group of samples is called together. However, in studies where new samples shall be added from time to time, this entails redundant computations since the joint calling of probably tens of samples has to repeated with a single new one added. We call this the N+1 problem.

Some of these problems are solved for special cases (e.g. paired tumor-normal calling). For generic studies, no variant caller exists to solve all three of them. ALPACA offers the following solution.

• It uses a novel algebraic variant calling model, that allows to incorporate filtering scenarios (like disease samples vs. healthy samples, tumor vs. normal, children vs. parents) into the calling. When calling, a filtering scenario can be specified with an algebraic expression like A - (B + C) with A, B and C being samples. Algebraic calling allows ALPACA to report posterior probabilities for a variant to occur in the unknown set of true variants in A that are not in B or C here. Since the probabilities reflect the filtering, they can be directly used to intuitively control the false discovery rate.

• Since the sample based filtering is integrated into the calling, insufficient evidence in a filter sample is less an issue with ALPACA, even low coverage variants will still affect the resulting posterior probability.

• ALPACA splits variant calling into a preprocessing step and the actual calling. Preprocessed samples are stored in HDF5 index data structures. In a lightweight and massively parallel step, the sample indexes are merged into an optimized index. On the optimized index, variant calling becomes a matter of seconds. Upon the addition of a sample, merging and the calling have to be repeated. The sample indexes of the other samples remain untouched, avoiding redundant computations.

Algorithmic and mathematical details will be described in my thesis:

Parallelization, Scalability and Reproducibility in Next-Generation Sequencing Analysis, Johannes Köster, 2015 (work in progress)

## Prerequisites

ALPACA needs

• Linux

• Python >= 3.3

• Numpy >= 1.7

• PyOpenCL >= 2013.1

• a working OpenCL device (CPU, GPU, a coprocessor like Intel Xeon Phi or an FPGA)

Python 3 should be installed on most systems. On Debian and Ubuntu you can make it ready for ALPACA by issueing:

$sudo apt-get install python3-setuptools python3-numpy Without admin rights, we recommend to use userspace Python 3 distribution like https://store.continuum.io/cshop/anaconda. If you want to use ALPACA on the GPU, a decent NVIDIA or AMD GPU with the proprietary drivers installed should be enough. On Ubuntu and Debian, you can install them via: $ sudo apt-get install nvidia-current

or:

## Usage

Usage of ALPACA consist of three major steps.

• sample indexing

• index merging

• calling

Given mapped reads for a sample A in BAM format and a reference genome in FASTA format, a sample index can be created with:

$alpaca index reference.fasta A.bam A.hdf5 Here, various parameters like the expected ploidy of the sample can be adjusted. The resulting index A.hdf5 will be much smaller than the BAM file. Merging indexes for samples A and B is achieved with: $ alpaca merge A.hdf5 B.hdf5 all.hdf5

Finally, calling can be performed on the merged index. ALPACA allows to specify query expressions at the command line by representing the union operator with a plus sign and the difference operator with a minus sign. The variant calls are streamed out in VCF:

$alpaca call --fdr 0.05 all.hdf5 A-B > calls.vcf Here, we limit the desired rate of false discoveries to 5%. To assess the biological importance of a variant, it is useful to annotate it with additional information like the gene it may be contained in, its effect on a protein that is encoded by the gene or whether it is already known and maybe associated to some disease. ALPACA can annotate a VCF file with such information, using the Ensembl Variant Effect Predictor web service. Since the VCF format is rather technical, ALPACA can compose a human readable HTML file summarizing the calls. We can combine the two commands using Unix pipes: $ alpaca annotate < calls.vcf | alpaca show > calls.html

For further information on various parameters of all steps (e.g. how to select the compute device) can be obtained with:

\$ alpaca --help

## News

 30 Nov 2015 Release 0.2 of ALPACA. This initial release provides all functionality descibed in my thesis “Parallelization, Scalability and Reproducibility in Next-Generation Sequencing Analysis”.

## Project details

Uploaded source