Ultima Genomics comparison scripts

Project description

ugbio_comparison

This module includes comparison python scripts and utilities for bioinformatics pipelines. It provides tools for comparing VCF callsets against ground truth datasets, with support for both small variant, structural variant (SV) and copy number (CNV) comparisons.

Overview

The comparison module is built on top of ugbio_core and provides command-line tools for variant benchmarking against ground truth datasets. For detailed usage of each tool, see the CLI Scripts section below.

The comparison module provides two main CLI scripts for variant comparison:

run_comparison_pipeline - Compare small variant callsets to ground truth using VCFEVAL
sv_comparison_pipeline - Compare structural variant (SV) callsets using Truvari

Installation

To install the comparison module with all dependencies:

pip install ugbio-comparison

The tool can also be run from the docker image in dockerhub ultimagenomics/ugbio_comparison.

CLI Scripts

1. run_comparison_pipeline

Compare VCF callsets to ground truth using VCFEVAL as the comparison engine. This pipeline supports, annotation with various genomic features, and detailed concordance analysis of specific variant types (downstream).

Purpose

Compare variant calls against a ground truth dataset
Generate concordance metrics (TP, FP, FN)
Annotate variants with coverage, mappability, and other genomic features
Annotate variants with properties like SNV/Indel/homopolymer Indel etc.

Usage

run_comparison_pipeline \
  --input_prefix <input_vcf_prefix> \
  --output_file <output_h5_file> \
  --output_interval <output_bed_file> \
  --gtr_vcf <ground_truth_vcf> \
  --highconf_intervals <high_confidence_bed> \
  --reference <reference_fasta> \
  --call_sample_name <sample_name> \
  --truth_sample_name <truth_sample_name> \

Key Parameters

--input_prefix: Prefix of the input VCF file(s)
--output_file: Output HDF5 file containing concordance results
--output_interval: Output BED file of intersected intervals
--gtr_vcf: Ground truth VCF file for comparison (e.g. GIAB VCF)
--cmp_intervals: Optional regions for comparison (BED/interval_list)
--highconf_intervals: High confidence intervals (e.g. GIAB HCR BED)
--reference: Reference genome FASTA file
--reference_dict: Reference genome dictionary file
--call_sample_name: Name of the call sample
--truth_sample_name: Name of the truth sample

Optional Parameters

--coverage_bw_high_quality: Input BigWig file with high MAPQ coverage
--coverage_bw_all_quality: Input BigWig file with all MAPQ coverage
--annotate_intervals: Interval files for annotation (can be specified multiple times)
--runs_intervals: Homopolymer runs intervals (BED file), used for annotation of closeness to homopolymer indel
--ignore_filter_status: Ignore variant filter status
--enable_reinterpretation: Enable variant reinterpretation (i.e. reinterpret variants using likely false hmer indel)
--scoring_field: Alternative scoring field to use (copied to TREE_SCORE)
--flow_order: Sequencing flow order (4 cycle, TGCA)
--n_jobs: Number of parallel jobs for chromosome processing (default: -1 for all CPUs)
--use_tmpdir: Store temporary files in temporary directory
--verbosity: Logging level (ERROR, WARNING, INFO, DEBUG)

Output Files

HDF5 file (output_file): Contains concordance dataframes with classifications (TP, FP, FN)
- concordance key: Main concordance results
- input_args key: Input parameters used
- Per-chromosome keys (for whole-genome mode)
BED files: Generated from concordance results for visualization

Example

run_comparison_pipeline \
  --input_prefix /data/sample.filtered \
  --output_file /results/sample.comp.h5 \
  --output_interval /results/sample.comp.bed \
  --gtr_vcf /reference/HG004_truth.vcf.gz \
  --highconf_intervals /reference/HG004_highconf.bed \
  --reference /reference/Homo_sapiens_assembly38.fasta \
  --call_sample_name SAMPLE-001 \
  --truth_sample_name HG004 \
  --n_jobs 8 \
  --verbosity INFO \

2. sv_comparison_pipeline

Compare structural variant (SV) callsets using Truvari for benchmarking. This pipeline collapses VCF files, runs Truvari bench, and generates concordance dataframes.

Purpose

Compare SV calls against a ground truth dataset using Truvari
We recommend using SV ground truth callsets from NIST as the source of truth
Collapse overlapping variants before comparison
Generate detailed concordance metrics for SVs
Support for different SV types (DEL, INS, DUP, etc.)
Output results in HDF5 format with base and calls concordance

Usage

sv_comparison_pipeline \
  --calls <input_calls_vcf> \
  --gt <ground_truth_vcf> \
  --hcr_bed <high confidence bed> \
  --output_filename <output_h5_file> \
  --outdir <truvari_output_dir>

Key Parameters

--calls: Input calls VCF file
--gt: Input ground truth VCF file
--output_filename: Output HDF5 file with concordance results
--outdir: Full path to output directory for Truvari results

Optional Parameters

--hcr_bed: High confidence region BED file
--pctseq: Percentage of sequence identity (default: 0.0)
--pctsize: Percentage of size identity (default: 0.0)
--maxsize: Maximum size for SV comparison in bp (default: 50000, use -1 for unlimited)
--custom_info_fields: Custom INFO fields to read from VCFs (can be specified multiple times)
--ignore_filter: Ignore FILTER field in VCF (removes --passonly flag from Truvari)
--skip_collapse: Skip VCF collapsing step for calls (ground truth is always collapsed)
--verbosity: Logging level (default: INFO)

Output files

HDF5 file (output_filename): Contains two concordance dataframes:
- base key: Ground truth concordance (TP, FN)
- calls key: Calls concordance (TP, FP)
Truvari directory (outdir): Contains Truvari bench results:
- tp-base.vcf.gz: True positive variants in ground truth
- tp-comp.vcf.gz: True positive variants in calls
- fn.vcf.gz: False negative variants
- fp.vcf.gz: False positive variants
- summary.json: Summary statistics

Example

sv_comparison_pipeline \
  --calls /data/sample.sv.vcf.gz \
  --gt /reference/HG004_sv_truth.vcf.gz \
  --output_filename /results/sample.sv_comp.h5 \
  --outdir /results/truvari_output \
  --hcr_bed /reference/HG004_sv_highconf.bed \
  --maxsize 100000 \
  --pctseq 0.7 \
  --pctsize 0.7 \
  --verbosity INFO

CNV Comparison

For copy number variant (CNV) comparisons, consider using a larger --maxsize value or -1 for unlimited:

sv_comparison_pipeline \
  --calls /data/sample.cnv.vcf.gz \
  --gt /reference/truth.cnv.vcf.gz \
  --output_filename /results/sample.cnv_comp.h5 \
  --outdir /results/truvari_cnv \
  --maxsize -1 \
  --ignore_filter

Dependencies

The following binary tools are included in the Docker image and need to be installed for standalone running:

bcftools 1.20 - VCF/BCF manipulation
samtools 1.20 - SAM/BAM/CRAM manipulation
bedtools 2.31.0 - Genome interval operations
bedops - BED file operations
GATK 4.6.0.0 - Genome Analysis Toolkit
Picard 3.3.0 - Java-based command-line tools for manipulating high-throughput sequencing data
RTG Tools 3.12.1 - Provides VCFEVAL for variant comparison

Notes

For best performance with large genomes, use parallel processing (--n_jobs)
The run_comparison_pipeline supports both single-interval and whole-genome modes
VCFEVAL requires an SDF index of the reference genome
Truvari comparison includes automatic VCF collapsing and sorting
Use --ignore_filter_status or --ignore_filter to compare all variants regardless of FILTER field

Project details

Release history Release notifications | RSS feed

1.24.2

May 7, 2026

1.24.1

Apr 28, 2026

This version

1.24.0

Apr 23, 2026

1.23.0

Apr 16, 2026

1.22.2

Mar 22, 2026

1.22.1

Mar 22, 2026

1.22.0

Mar 19, 2026

1.21.4

Mar 1, 2026

1.21.3

Mar 1, 2026

1.21.2

Feb 19, 2026

1.21.1

Feb 17, 2026

1.21.0

Feb 17, 2026

1.20.0

Jan 25, 2026

1.19.0

Jan 4, 2026

1.18.0

Dec 24, 2025

1.17.2

Nov 23, 2025

1.17.1

Nov 20, 2025

1.17.0

Nov 20, 2025

1.16.2

Oct 30, 2025

1.16.1

Oct 21, 2025

1.16.0

Oct 21, 2025

1.15.0

Sep 14, 2025

1.14.0

Aug 28, 2025

1.13.2

Aug 21, 2025

1.13.1

Aug 17, 2025

1.13.0

Aug 7, 2025

1.12.0

Jul 17, 2025

1.11.0

Jun 23, 2025

1.10.2

May 28, 2025

1.10.1

May 22, 2025

1.10.0

May 21, 2025

1.8.0

Mar 18, 2025

1.7.0

Mar 17, 2025

1.6.1

Feb 18, 2025

1.6.0

Feb 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugbio_comparison-1.24.0.tar.gz (21.8 kB view details)

Uploaded Apr 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ugbio_comparison-1.24.0-py3-none-any.whl (21.9 kB view details)

Uploaded Apr 23, 2026 Python 3

File details

Details for the file ugbio_comparison-1.24.0.tar.gz.

File metadata

Download URL: ugbio_comparison-1.24.0.tar.gz
Upload date: Apr 23, 2026
Size: 21.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ugbio_comparison-1.24.0.tar.gz
Algorithm	Hash digest
SHA256	`96288b5f7b059f15e7bae61b7f3c253beec9d7c6223d2b203607412239800b8f`
MD5	`5cc07d3f90aa963d950397c47407e0c4`
BLAKE2b-256	`9fd018c5c4cabaa0f16ddc48baf1d8bc2d6c773ba1fde9651347a6a82562d4f1`

See more details on using hashes here.

File details

Details for the file ugbio_comparison-1.24.0-py3-none-any.whl.

File metadata

Download URL: ugbio_comparison-1.24.0-py3-none-any.whl
Upload date: Apr 23, 2026
Size: 21.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ugbio_comparison-1.24.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fd99d77c53e78945af8e91e3acbfb192aaa7e5e84cd2fb3c309521f76cd5d41d`
MD5	`1571278ad81341788aa3d1193421e636`
BLAKE2b-256	`bb7d8f4f6b609809432cbe4a7b43fbb1e9d59fba5ce10932449a15ed807de8d3`

See more details on using hashes here.

ugbio-comparison 1.24.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ugbio_comparison

Overview

Installation

CLI Scripts

1. run_comparison_pipeline

Purpose

Usage

Key Parameters

Optional Parameters

Output Files

Example

2. sv_comparison_pipeline

Purpose

Usage

Key Parameters

Optional Parameters

Output files

Example

CNV Comparison

Dependencies

Notes

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes