Skip to main content

Structural variant comparison tool for VCFs

Project description

████████╗██████╗ ██╗   ██╗██╗   ██╗ █████╗ ██████╗ ██╗
╚══██╔══╝██╔══██╗██║   ██║██║   ██║██╔══██╗██╔══██╗██║
   ██║   ██████╔╝██║   ██║██║   ██║███████║██████╔╝██║
   ██║   ██╔══██╗██║   ██║╚██╗ ██╔╝██╔══██║██╔══██╗██║
   ██║   ██║  ██║╚██████╔╝ ╚████╔╝ ██║  ██║██║  ██║██║
   ╚═╝   ╚═╝  ╚═╝ ╚═════╝   ╚═══╝  ╚═╝  ╚═╝╚═╝  ╚═╝╚═╝

Structural variant comparison tool for VCFs

Given benchmark and comparsion sets of SVs, calculate the recall, precision, and f-measure.

Spiral Genetics

Motivation

UPDATES

Truvari has some big changes. In order to keep up with the retirement of Python 2.7 https://pythonclock.org/ We're now only supporting Python 3.

Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation below. This will enable us to refactor the code for easier maintenance and reusability.

Finally, we now automatically report genotype comparisons in the summary stats.

Installation

Truvari uses Python 3.7 and can be installed with pip:

$ pip install Truvari

Quick start

$ truvari -b base_calls.vcf -c compare_calls.vcf -o output_dir/

Outputs

  • tp-call.vcf -- annotated true positive calls from the COMP
  • tp-base.vcf -- anotated true positive calls form the BASE
  • fn.vcf -- false negative calls from BASE
  • fp.vcf -- false positive calls from COMP
  • base-filter.vcf -- size filtered calls from BASE
  • call-filter.vcf -- size filtered calls from COMP
  • summary.txt -- json output of performance stats
  • log.txt -- run log
  • giab_report.txt -- (optional) Summary of GIAB benchmark calls. See "Using the GIAB Report" below.

summary.txt

The following stats are generated for benchmarking your call set.

MetricDefinition
TP-baseNumber of matching calls from the base vcf
TP-callNumber of matching calls from the comp vcf
FPNumber of non-matching calls from the comp vcf
FNNumber of non-matching calls from the base vcf
precisionTP-call / (TP-call + FP)
recallTP-base / (TP-base + FN)
f1(recall * precision) / (recall + precision)
base cntNumber of calls in the base vcf
call cntNumber of calls in the comp vcf
base size filteredNumber of base vcf calls outside of (sizemin, sizemax)
call size filteredNumber of comp vcf calls outside of (sizemin, sizemax)
base gt filteredNumber of base calls not passing the no-ref parameter filter
call gt filteredNumber of comp calls not passing the no-ref parameter filter
TP-call_TP-gtTP-call's with genotype match
TP-call_FP-gtTP-call's without genotype match
TP-base_TP-gtTP-base's with genotype match
TP-base_FP-gtTP-base's without genotype match
gt_precisionTP-call_TP-gt / (TP-call_TP-gt + FP + TP-call_FP-gt)
gt_recallTP-base_TP-gt / (TP-base_TP-gt / FN)
gt_f1(gt_recall * gt_precision) / (gt_recall + gt_precision)

Methodology

Input:
    BaseCall - Benchmark TruthSet of SVs
    CompCalls - Comparison SVs from another program
Build IntervalTree of CompCalls
For each BaseCall:
  Fetch CompCalls overlapping within *refdist*. 
    If typematch and LevDistRatio >= *pctsim* \
    and SizeRatio >= *pctsize* and PctRecOvl >= *pctovl*: 
      Add CompCall to list of Neighbors
  Sort list of Neighbors by TruScore ((2*sim + 1*size + 1*ovl) / 3.0)
  Take CompCall with highest TruScore and BaseCall as TPs
  Only use a CompCall once if not --multimatch
  If no neighbors: BaseCall is FN
For each CompCall:
  If not used: mark as FP

Matching Parameters

ParameterDefaultDefinition
refdist500 Maximum distance comparison calls must be within from base call's start/end
pctsim0.7 Levenshtein distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Haplotype Sequences of Variants" below.
pctsize0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignoreFalse Types don't need to match to compare calls.

Comparing VCFs without sequence resolved calls

If the base or comp vcfs do not have sequence resolved calls, simply set --pctsim=0 to turn off sequence comparison.

Difference between --sizemin and --sizefilt

--sizemin is the minimum size of a base call or comparison call to be considered.

--sizefilt is the minimum size of a call to be added into the IntervalTree for searching. It should be less than sizemin for edge case variants.

For example: sizemin is 50 and sizefilt is 30. A 50bp base call is 98% similar to a 49bp call at the same position.

These two calls should be considered matching. If we instead removed calls less than sizemin, we'd incorrectly classify the 50bp base call as a false negative.

This does have the side effect of artificially inflating specificity. If that same 49bp call in the above were below the similarity threshold, it would not be classified as a FP due to the sizemin threshold. So we're giving the call a better chance to be useful and less chance to be detrimental to final statistics.

Definition of annotations added to TP vcfs

Anno Definition
TruScore Truvari score for similarity of match. `((2*sim + 1*size + 1*ovl) / 3.0)`
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and it's closest match
PctRecOverlap Percent reciprocal overlap of the two calls' coordinates
StartDistance Distance of this call's start from matching call's start
EndDistance Distance of this call's end from matching call's end
SizeDiff Difference in size(basecall) and size(compcall)
NumNeighbors Number of comparison calls that were in the neighborhood (REFDIST) of the base call
NumThresholdNeighbors Number of comparison calls that passed threshold matching of the base call

NumNeighbors and NumThresholdNeighbors are also added to the FN vcf.

Using the GIAB Report

When running against the GIAB SV benchmark (link below), you can create a detailed report of calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations.

To create this report.

  1. Run truvari with the flag --giabreport.
  2. In your output directory, you will find a file named giab_report.txt.
  3. Next, make a copy of the Truvari Report Template Google Sheet.
  4. Finally, paste ALL of the information inside giab_report.txt into the "RawData" tab. Be careful not to alter the report text in any way. If successul, the "Formatted" tab you will have a fully formated report.

While Truvari can use other benchmark sets, this formatted report currently only works with GIAB SV v0.5 and v0.6. Work will need to be done to ensure Truvari can parse future GIAB SV releases.

GIAB v0.6 Download Link

Include Bed & VCF Header Contigs

If an --includebed is provided, only base and comp calls contained within the defined regions are used for comparison. This is similar to pre-filtering your base/comp calls with:

(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz

with the exception that Truvari requires the start and the end to be contained in the same includebed region whereas bedtools intersect does not.

If an --includebed is not provided, the comparison is restricted to only the contigs present in the base VCF header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary statistics and will not be present in any output vcfs.

Comparing Haplotype Sequences of Variants

To compare the sequence similarity, build the haplotypes over the range of min(call starts)-max(call ends) and build the sequence change from the variants. For example:

hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq

Where a1_seq1 is the longer of the REF or ALT allele.

More Information

Find more details and discussions about Truvari on the WIKI page.

Spiral Genetics

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Truvari-1.3.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

Truvari-1.3-py3-none-any.whl (23.6 kB view details)

Uploaded Python 3

File details

Details for the file Truvari-1.3.tar.gz.

File metadata

  • Download URL: Truvari-1.3.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for Truvari-1.3.tar.gz
Algorithm Hash digest
SHA256 9bbf6b98c632b4de4f97791cfaca9af14541d773f29a3b847dfd5edda678dc25
MD5 0e39ce8a073b4416094118f5bfb897e2
BLAKE2b-256 be97b8b8fcf4885828db1f632c57a7ee947445f155538692ad606a9273e870a0

See more details on using hashes here.

File details

Details for the file Truvari-1.3-py3-none-any.whl.

File metadata

  • Download URL: Truvari-1.3-py3-none-any.whl
  • Upload date:
  • Size: 23.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.6.7

File hashes

Hashes for Truvari-1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ec7803f8b4578caedc7f2c7c94ff07fd8fb1b398fc8cd3847b71556070d3eb61
MD5 cf8d1e0fb447b0b930d4038862e3f3be
BLAKE2b-256 03d8a703bd1189689d185d4bcd1bf27fc824d2bd5e9c234d9de367a00b301728

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page