Ultima Genomics comparison scripts
Project description
ugbio_comparison
This module includes comparison python scripts and utilities for bioinformatics pipelines. It provides tools for comparing VCF callsets against ground truth datasets, with support for both small variant, structural variant (SV) and copy number (CNV) comparisons.
Overview
The comparison module is built on top of ugbio_core and provides command-line tools for variant benchmarking against ground truth datasets. For detailed usage of each tool, see the CLI Scripts section below.
The comparison module provides two main CLI scripts for variant comparison:
- run_comparison_pipeline - Compare small variant callsets to ground truth using VCFEVAL
- sv_comparison_pipeline - Compare structural variant (SV) callsets using Truvari
Installation
To install the comparison module with all dependencies:
pip install ugbio-comparison
The tool can also be run from the docker image in dockerhub ultimagenomics/ugbio_comparison.
CLI Scripts
1. run_comparison_pipeline
Compare VCF callsets to ground truth using VCFEVAL as the comparison engine. This pipeline supports, annotation with various genomic features, and detailed concordance analysis of specific variant types (downstream).
Purpose
- Compare variant calls against a ground truth dataset
- Generate concordance metrics (TP, FP, FN)
- Annotate variants with coverage, mappability, and other genomic features
- Annotate variants with properties like SNV/Indel/homopolymer Indel etc.
Usage
run_comparison_pipeline \
--input_prefix <input_vcf_prefix> \
--output_file <output_h5_file> \
--output_interval <output_bed_file> \
--gtr_vcf <ground_truth_vcf> \
--highconf_intervals <high_confidence_bed> \
--reference <reference_fasta> \
--call_sample_name <sample_name> \
--truth_sample_name <truth_sample_name> \
Key Parameters
--input_prefix: Prefix of the input VCF file(s)--output_file: Output HDF5 file containing concordance results--output_interval: Output BED file of intersected intervals--gtr_vcf: Ground truth VCF file for comparison (e.g. GIAB VCF)--cmp_intervals: Optional regions for comparison (BED/interval_list)--highconf_intervals: High confidence intervals (e.g. GIAB HCR BED)--reference: Reference genome FASTA file--reference_dict: Reference genome dictionary file--call_sample_name: Name of the call sample--truth_sample_name: Name of the truth sample
Optional Parameters
--coverage_bw_high_quality: Input BigWig file with high MAPQ coverage--coverage_bw_all_quality: Input BigWig file with all MAPQ coverage--annotate_intervals: Interval files for annotation (can be specified multiple times)--runs_intervals: Homopolymer runs intervals (BED file), used for annotation of closeness to homopolymer indel--ignore_filter_status: Ignore variant filter status--enable_reinterpretation: Enable variant reinterpretation (i.e. reinterpret variants using likely false hmer indel)--scoring_field: Alternative scoring field to use (copied to TREE_SCORE)--flow_order: Sequencing flow order (4 cycle, TGCA)--n_jobs: Number of parallel jobs for chromosome processing (default: -1 for all CPUs)--use_tmpdir: Store temporary files in temporary directory--verbosity: Logging level (ERROR, WARNING, INFO, DEBUG)
Output Files
- HDF5 file (
output_file): Contains concordance dataframes with classifications (TP, FP, FN)concordancekey: Main concordance resultsinput_argskey: Input parameters used- Per-chromosome keys (for whole-genome mode)
- BED files: Generated from concordance results for visualization
Example
run_comparison_pipeline \
--input_prefix /data/sample.filtered \
--output_file /results/sample.comp.h5 \
--output_interval /results/sample.comp.bed \
--gtr_vcf /reference/HG004_truth.vcf.gz \
--highconf_intervals /reference/HG004_highconf.bed \
--reference /reference/Homo_sapiens_assembly38.fasta \
--call_sample_name SAMPLE-001 \
--truth_sample_name HG004 \
--n_jobs 8 \
--verbosity INFO \
2. sv_comparison_pipeline
Compare structural variant (SV) callsets using Truvari for benchmarking. This pipeline collapses VCF files, runs Truvari bench, and generates concordance dataframes.
Purpose
- Compare SV calls against a ground truth dataset using Truvari
- We recommend using SV ground truth callsets from NIST as the source of truth
- Collapse overlapping variants before comparison
- Generate detailed concordance metrics for SVs
- Support for different SV types (DEL, INS, DUP, etc.)
- Output results in HDF5 format with base and calls concordance
Usage
sv_comparison_pipeline \
--calls <input_calls_vcf> \
--gt <ground_truth_vcf> \
--hcr_bed <high confidence bed> \
--output_filename <output_h5_file> \
--outdir <truvari_output_dir>
Key Parameters
--calls: Input calls VCF file--gt: Input ground truth VCF file--output_filename: Output HDF5 file with concordance results--outdir: Full path to output directory for Truvari results
Optional Parameters
--hcr_bed: High confidence region BED file--pctseq: Percentage of sequence identity (default: 0.0)--pctsize: Percentage of size identity (default: 0.0)--maxsize: Maximum size for SV comparison in bp (default: 50000, use -1 for unlimited)--custom_info_fields: Custom INFO fields to read from VCFs (can be specified multiple times)--ignore_filter: Ignore FILTER field in VCF (removes --passonly flag from Truvari)--skip_collapse: Skip VCF collapsing step for calls (ground truth is always collapsed)--verbosity: Logging level (default: INFO)
Output files
- HDF5 file (
output_filename): Contains two concordance dataframes:basekey: Ground truth concordance (TP, FN)callskey: Calls concordance (TP, FP)
- Truvari directory (
outdir): Contains Truvari bench results:tp-base.vcf.gz: True positive variants in ground truthtp-comp.vcf.gz: True positive variants in callsfn.vcf.gz: False negative variantsfp.vcf.gz: False positive variantssummary.json: Summary statistics
Example
sv_comparison_pipeline \
--calls /data/sample.sv.vcf.gz \
--gt /reference/HG004_sv_truth.vcf.gz \
--output_filename /results/sample.sv_comp.h5 \
--outdir /results/truvari_output \
--hcr_bed /reference/HG004_sv_highconf.bed \
--maxsize 100000 \
--pctseq 0.7 \
--pctsize 0.7 \
--verbosity INFO
CNV Comparison
For copy number variant (CNV) comparisons, consider using a larger --maxsize value or -1 for unlimited:
sv_comparison_pipeline \
--calls /data/sample.cnv.vcf.gz \
--gt /reference/truth.cnv.vcf.gz \
--output_filename /results/sample.cnv_comp.h5 \
--outdir /results/truvari_cnv \
--maxsize -1 \
--ignore_filter
Dependencies
The following binary tools are included in the Docker image and need to be installed for standalone running:
- bcftools 1.20 - VCF/BCF manipulation
- samtools 1.20 - SAM/BAM/CRAM manipulation
- bedtools 2.31.0 - Genome interval operations
- bedops - BED file operations
- GATK 4.6.0.0 - Genome Analysis Toolkit
- Picard 3.3.0 - Java-based command-line tools for manipulating high-throughput sequencing data
- RTG Tools 3.12.1 - Provides VCFEVAL for variant comparison
Notes
- For best performance with large genomes, use parallel processing (
--n_jobs) - The
run_comparison_pipelinesupports both single-interval and whole-genome modes - VCFEVAL requires an SDF index of the reference genome
- Truvari comparison includes automatic VCF collapsing and sorting
- Use
--ignore_filter_statusor--ignore_filterto compare all variants regardless of FILTER field
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ugbio_comparison-1.24.1.tar.gz.
File metadata
- Download URL: ugbio_comparison-1.24.1.tar.gz
- Upload date:
- Size: 21.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8bf968fd7158db905f0b59a74b199701e9bd8b9ebab587c5e4acbe95d3e65de
|
|
| MD5 |
e971b0397a689a0a110ceca024b9fb50
|
|
| BLAKE2b-256 |
732eceb3892dc0b4b59bc58d81a1905d6e1c070d14de85582a3c5666a65804bc
|
File details
Details for the file ugbio_comparison-1.24.1-py3-none-any.whl.
File metadata
- Download URL: ugbio_comparison-1.24.1-py3-none-any.whl
- Upload date:
- Size: 21.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c32a6f8bef9bdb7d0fdf558b052d8e05c1acd7db73742a979d757231ea879111
|
|
| MD5 |
ba1df4a29e15e8420af34845bebf3742
|
|
| BLAKE2b-256 |
2fdc88c70770d88d56ce2deefc3020777b0071c89c98b8f694eab172a527307a
|