Skip to main content

Ultima Genomics CNV utils

Project description

ugbio_cnv

This module provides Python scripts and utilities for Copy Number Variation (CNV) analysis in bioinformatics pipelines.

Overview

The CNV module integrates multiple CNV calling algorithms and provides tools for processing, filtering, combining, annotating, and visualizing CNV calls. It supports both germline and somatic CNV detection workflows.

Note that the package itself does not call variants, it just provides facilities for preparing data, calling, changing the format and combining callsets.

The package is designed to work with the following CNV callers:

  • cn.mops - Read depth-based CNV caller using a Bayesian approach
  • CNVpytor - Read depth analysis for CNV detection
  • ControlFREEC - Control-FREEC for somatic CNV detection

Installation

Using UV (Recommended)

Install the CNV module and its dependencies:

pip install ugbio-cnv

Pre-built docker image can be downloaded from Dockerhub: ultimagenomics/ugbio_cnv

Available Tools

CNV Processing

process_cnmops_cnvs

Process CNV calls in BED format from cn.mops and ControlFREEC: filter by length and low-complexity regions, annotate, and convert to VCF format.

process_cnmops_cnvs \
  --input_bed_file cnmops_calls.bed \
  --cnv_lcr_file ug_cnv_lcr.bed \
  --min_cnv_length 10000 \
  --intersection_cutoff 0.5 \
  --out_directory ./output

Key Parameters:

  • --input_bed_file - Input BED file from cn.mops
  • --cnv_lcr_file - UG-CNV-LCR BED file for filtering low-complexity regions (see workflows for the BED)
  • --min_cnv_length - CNVs below this length will be marked (default: 10000)
  • --intersection_cutoff - Overlap threshold with the cnv lcr(default: 0.5)

Combining CNV Calls

Tools for combining and analyzing CNV calls (currently implemented combining of CNV calls from cn.mops and CNVPytor) are all aggregated under CLI interface combine_cnmops_cnvpytor_cnv_calls. This CLI contains the following tools - each can also be called by a standalone script:

    concat              Combine CNV VCFs from different callers (cn.mops and cnvpytor)
    filter_cnmops_dups  Filter short duplications from cn.mops calls in the combined CNV VCF
    annotate_gaps       Annotate CNV calls with percentage of gaps (Ns) from reference genome
    annotate_regions    Annotate CNV calls with region annotations from BED file
    merge_records       Merge adjacent or nearby CNV records in a VCF file

concat

Concatenate CNV VCF files from different callers (cn.mops and CNVpytor) into a single sorted and indexed VCF. The tool adds "source" tag for each CNV

combine_cnv_vcfs \
  --cnmops_vcf cnmops1.vcf cnmops2.vcf \
  --cnvpytor_vcf cnvpytor1.vcf cnvpytor2.vcf \
  --output_vcf combined.vcf.gz \
  --fasta_index reference.fasta.fai \
  --out_directory ./output

annotate_regions

Annotate CNV calls with custom genomic regions that they overlap. The BED is expected to contain |-separated names of regions in the fourth column. The annotation is added to the info field under tag REGION_ANNOTATION

annotate_regions \
  --input_vcf calls.vcf.gz \
  --output_vcf annotated.vcf.gz \
  --annotation_bed regions.bed

annotate_gaps

Annotate CNV calls with percentage of Ns that they cover. Adds an info tag GAPS_PERCENTAGE

annotate_gaps \
  --calls_vcf calls.vcf.gz \
  --output_vcf annotated.vcf.gz \
  --ref_fasta Homo_sapiens_assembly38.fasta

merge_records

Combines overlapping records in the VCF

merge_records
   --input_vcf calls.vcf.gz \
   --output_vcf calls.combined.vcf.gz \
   --distance 0

analyze_cnv_breakpoint_reads

Analyze single-ended reads at CNV breakpoints to identify supporting evidence for duplications and deletions. Counts of supporting evidence appear as info tags in the VCF

analyze_cnv_breakpoint_reads \
  --vcf-file cnv_calls.vcf.gz \
  --bam-file sample.bam \
  --output-file annotated.vcf.gz \
  --cushion 100 \
  --reference-fasta Homo_sapiens_assembly38.fasta

Somatic CNV Tools (ControlFREEC)

annotate_FREEC_segments

Annotate segments from ControlFREEC output as gain/loss/neutral based on fold-change thresholds.

annotate_FREEC_segments \
  --input_segments_file segments.txt \
  --gain_cutoff 1.03 \
  --loss_cutoff 0.97 \
  --out_directory ./output

Visualization

plot_cnv_results

Generate coverage plots along the genome for germline and tumor samples.

plot_cnv_results \
  --sample_name SAMPLE \
  --germline_cov_file germline_coverage.bed \
  --tumor_cov_file tumor_coverage.bed \
  --cnv_file cnv_calls.bed \
  --out_directory ./plots

plot_FREEC_neutral_AF

Generate histogram of allele frequencies at neutral (non-CNV) locations.

plot_FREEC_neutral_AF \
  --input_file neutral_regions.txt \
  --sample_name SAMPLE \
  --out_directory ./plots

Dependencies

The module depends on:

  • Python 3.11+
  • ugbio_core - Core utilities from this workspace
  • CNVpytor (1.3.1) - Python-based CNV caller
  • cn.mops (R package) - Bayesian CNV detection
  • Bioinformatics tools: samtools, bedtools, bcftools
  • R 4.3.1 with Bioconductor packages

Key R Scripts

The module includes R scripts in the cnmops/ directory. They are used by cn.mops pipeline and are not intended for standalone usage.

  • cnv_calling_using_cnmops.R - Main cn.mops calling script
  • get_reads_count_from_bam.R - Extract read counts from BAM files
  • create_reads_count_cohort_matrix.R - Build cohort matrix for cn.mops
  • normalize_reads_count.R - Normalize read counts across samples
  • rebin_cohort_reads_count.R - Re-bin existing cohort to larger bin sizes

Re-binning CNmops Cohorts

rebin_cohort_reads_count.R

Re-bin an existing cn.mops cohort from smaller bins to larger bins by aggregating read counts. This allows you to adjust the resolution of existing cohorts without regenerating from BAM files, which is useful for:

  • Reducing computational memory requirements for large cohorts
  • Faster CNV calling with coarser resolution
  • Testing different bin sizes without re-processing BAM files

Usage:

Rscript cnmops/rebin_cohort_reads_count.R \
  -i cohort_1000bp.rds \
  -owl 1000 \
  -nwl 5000 \
  -o cohort_5000bp.rds \
  --save_csv

Parameters:

  • -i, --input_cohort_file - Input cohort RDS file (required)
  • -owl, --original_window_length - Original bin size in bp (optional), autodetected if not given
  • -nwl, --new_window_length - New bin size in bp (required, must be divisible by original)
  • -o, --output_file - Output RDS file (default: rebinned_cohort_reads_count.rds)
  • --save_csv - Also save as CSV format
  • --save_hdf - Also save as HDF5 format

Important Notes:

  • New window length must be larger than and divisible by the original window length
  • Genomic coordinates use 1-based, right-closed format (e.g., 1-1000, 1001-2000, ...)
  • Partial bins at chromosome ends are preserved without artificial extension
  • Read counts are summed from all original bins within each new bin
  • Total read counts per sample are preserved across the rebinning

Example:

# Re-bin HapMap2 cohort from 1000 bp to 5000 bp
Rscript cnmops/rebin_cohort_reads_count.R \
  -i HapMap2_65samples_cohort_v2.0.hg38.ReadsCount.rds \
  -nwl 5000 \
  -o HapMap2_65samples_cohort_v2.0.hg38.ReadsCount.5000bp.rds

# Result: 3,088,281 bins → 617,665 bins (5x reduction)
# Last bin on chr1: 248955001-248956422 (partial bin, not extended to 248960000)

Notes

  • See germline and somatic CNV calling workflows published in GH repository Ultimagen\healthomics-workflows for the reference implementations of the suggested workflows.
  • For optimal CNV calling, use cohort-based approaches when multiple samples are available
  • Filter CNV calls using the provided LCR (low-complexity region) files to reduce false positives
  • Consider minimum CNV length thresholds based on your sequencing depth and biological context
  • The module supports both GRCh37 and GRCh38 reference genomes

Key Components

process_cnvs

Process CNV calls from CN.MOPS or ControlFREEC in BED format: filter by length and UG-CNV-LCR, annotate with coverage statistics, and convert to VCF format.

Note: This module is called programmatically (not via CLI) from other pipeline scripts.

Programmatic Usage

The process_cnvs module is typically invoked from other pipeline components. Here are examples:

Basic usage (minimal filtering):

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001"
])

With LCR filtering and length thresholds:

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--cnv_lcr_file", "ug_cnv_lcr.bed",
    "--min_cnv_length", "10000",
    "--intersection_cutoff", "0.5",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001",
    "--out_directory", "/path/to/output/"
])

Full pipeline with coverage annotations:

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--cnv_lcr_file", "ug_cnv_lcr.bed",
    "--min_cnv_length", "10000",
    "--sample_norm_coverage_file", "sample.normalized_coverage.bed",
    "--cohort_avg_coverage_file", "cohort.average_coverage.bed",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001",
    "--out_directory", "/path/to/output/",
    "--verbosity", "INFO"
])

Input: BED file with CNV calls from CN.MOPS or ControlFREEC

Output: Filtered and annotated VCF file with CNV calls (.vcf.gz and .vcf.gz.tbi)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugbio_cnv-1.21.3.tar.gz (54.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ugbio_cnv-1.21.3-py3-none-any.whl (60.2 kB view details)

Uploaded Python 3

File details

Details for the file ugbio_cnv-1.21.3.tar.gz.

File metadata

  • Download URL: ugbio_cnv-1.21.3.tar.gz
  • Upload date:
  • Size: 54.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_cnv-1.21.3.tar.gz
Algorithm Hash digest
SHA256 618a7025f2527b92a555ede513dfb5577bbbd0e9ac3463865bf080bed6f351bc
MD5 1799d77346b2de4f0bf9947e05bca75b
BLAKE2b-256 76df4c190fdb06cee24a89712ffe460a1ed77a41d7b5df7d0165e4ac19815b34

See more details on using hashes here.

File details

Details for the file ugbio_cnv-1.21.3-py3-none-any.whl.

File metadata

  • Download URL: ugbio_cnv-1.21.3-py3-none-any.whl
  • Upload date:
  • Size: 60.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_cnv-1.21.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e34a235efa4e3e72a91c1059d39faefcf35188194432b284c1dc63b99f92d34c
MD5 b302cd1a3fc379da42c2f0fb41c435f9
BLAKE2b-256 abd5e96d8663dad4069b322fc191f0dbd3c048cf5b9f99f56b49772857a55f0f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page