Skip to main content

Check the accuracy of one VCF callset against another

Project description

A benchmarking toolkit for variant calling. For a full description of the SMaSH toolkit, including our datasets, results, and instructions for running SMaSH on EC2, see our website at smash.cs.berkeley.edu.

If you have any questions, please visit our support forum at https://groups.google.com/forum/#!forum/smash-benchmarking.

SMaSH Overview

The simplest SMaSH use case is to compare the VCF output of a variant calling algorithm against a validated truth set. We do this by calling the bench.py script with three main arguments: the ground truth VCF, the predicted VCF, and a reference FASTA.

## Evaluation

SMaSH evaluates three main types of variants: SNPs, indels, and structural variants. Indels are defined as a variant shorter than 50 base pairs; longer variants are considered structural variants. Indels are further subdivided into insertions, deletions, inversions, and other; structural variants are likewise broken into insertions, deletions, and others.

First, SMaSH checks all locations where both the true VCF and predicted VCF have variants. SNPs and indels are considered matches if the reference allele and first alternative allele are the same. Structural variants are considered matches if they are the same type of variant (insertion, deletion, or many-to-many) and if their position is with a certain breakpoint tolerance of the ground truth variant. (SV tolerance can be specified when calling SMaSH; the default value is 50 base pairs).

However, the same underlying variants can be represented in different ways in the VCF format. SMaSH addresses this ambiguity through its rescue algorithm: for each variant in the truth VCF not matched in the predicted VCF, SMaSH expands the sequence described by the true and predicted callsets to a defined window size (can be specified at runtime, defaults to 100bp). If the sequences match, the true callset variants are marked as true positives, and the predicted callset variants are removed from the set of false positives.

## Error rates For non-synthetic datasets, the SMaSH output should express the likelihood that a given call in the ground truth VCF is incorrect. For example, if we bound the rate of incorrect calls for indels at 1 in 500, we pass that information to the benchmarking script with the option –indel_err 0.002 (SNP and SV error rates are passed similarly with –snp_err and –sv_err.) These error rates will be used to calculate error bars on the precision and recall rates in SMaSH output.

## Outputs

After evaluation, SMaSH reports results as human-readable text (default) or as tab-delimited output. The results include counts of true positives, false positives, and false negatives; as well as precision and recall calculations with bounded error based on the error rates passed in at runtime; and some genotyping accuracy metrics. SMaSH can also output a fully annotated VCF file containing all variants from both the true and predicted VCF input files, tagged as true positive, false positive, false negative, or rescued.

## Known False Positive Mode

For cases in which we do not have validated truth calls for the entire genome, SMaSH may be called with an additional known true positive VCF. With this option, SMaSH will further mark predicted calls as conflicting with known false positives if they are at the same location, of the same variant type, and have the same ref sequence. Precision will also be calculated with this false positive count.

If the ground truth data is not comprehensive, but we do not have a set of known false positive variants in VCF format, we can hide false positive-related stats in the output with the –hide_fp flag.

## Normalization

SMaSH also includes a normalization script which first cleans the VCF by upper-casing all alleles, removing homozygous reference and monomorphic allelic calls, and so on. The script then left-normalizes all calls as much as possible as a step toward standardizing ambiguous calls in a repetitive sequence. Although left-normalization is not strictly necessary to run SMaSH, it improves accuracy and reduces runtime by reducing the number of times the rescue algorithm needs to be called. Normalization can be performed as part of the evaluation script call by adding the –normalize option.

## Assumptions and Limitations

  • There is presently no way to limit parts of the genome evaluated (e.g. the exome, etc).

  • At present SMaSH does not evaluate phasing.

  • SMaSH does not handle compound heterozygous variants.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smashbenchmarking-1.0.1.tar.gz (31.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

smashbenchmarking-1.0.1-py2.py3-none-any.whl (49.8 kB view details)

Uploaded Python 2Python 3

File details

Details for the file smashbenchmarking-1.0.1.tar.gz.

File metadata

File hashes

Hashes for smashbenchmarking-1.0.1.tar.gz
Algorithm Hash digest
SHA256 b92a944a167f27e73e6c921881643981cd3d2721b4b0fef20e07fe6cdecdfe2f
MD5 77f0c979549a645f849ec28ef8afcda4
BLAKE2b-256 55c4721c82a57d3e6b7e3dead4a5d38f1ab04afa314ed8843b9b5b5e72a13e7a

See more details on using hashes here.

File details

Details for the file smashbenchmarking-1.0.1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for smashbenchmarking-1.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 5a9b697e9fc498bd703fc98174552ff5104ba08b87e18b736aab413d5f973809
MD5 1a28b675d53adf3f1c944c43dc61b0e0
BLAKE2b-256 255b38d336c398fd9d2d72f036f5138de6a8b5c353c5cbc55cbfdb3c148d881c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page