Skip to main content

Ultima Genomics CNV utils

Project description

ugbio_cnv

This module provides Python scripts and utilities for Copy Number Variation (CNV) analysis in bioinformatics pipelines.

Overview

The CNV module integrates multiple CNV calling algorithms and provides tools for processing, filtering, combining, annotating, and visualizing CNV calls. It supports both germline and somatic CNV detection workflows.

Note that the package itself does not call variants, it just provides facilities for preparing data, calling, changing the format and combining callsets.

The package is designed to work with the following CNV callers:

  • cn.mops - Read depth-based CNV caller using a Bayesian approach
  • CNVpytor - Read depth analysis for CNV detection
  • ControlFREEC - Control-FREEC for somatic CNV detection

Installation

Using UV (Recommended)

Install the CNV module and its dependencies:

pip install ugbio-cnv

Pre-built docker image can be downloaded from Dockerhub: ultimagenomics/ugbio_cnv

Available Tools

CNV Processing

process_cnmops_cnvs

Process CNV calls in BED format from cn.mops and ControlFREEC: filter by length and low-complexity regions, annotate, and convert to VCF format.

process_cnmops_cnvs \
  --input_bed_file cnmops_calls.bed \
  --cnv_lcr_file ug_cnv_lcr.bed \
  --min_cnv_length 10000 \
  --intersection_cutoff 0.5 \
  --out_directory ./output

Key Parameters:

  • --input_bed_file - Input BED file from cn.mops
  • --cnv_lcr_file - UG-CNV-LCR BED file for filtering low-complexity regions (see workflows for the BED)
  • --min_cnv_length - CNVs below this length will be marked (default: 10000)
  • --intersection_cutoff - Overlap threshold with the cnv lcr(default: 0.5)

Combining CNV Calls

Tools for combining and analyzing CNV calls (currently implemented combining of CNV calls from cn.mops and CNVPytor) are all aggregated under CLI interface combine_cnmops_cnvpytor_cnv_calls. This CLI contains the following tools - each can also be called by a standalone script:

    concat              Combine CNV VCFs from different callers (cn.mops and cnvpytor)
    filter_cnmops_dups  Filter short duplications from cn.mops calls in the combined CNV VCF
    annotate_gaps       Annotate CNV calls with percentage of gaps (Ns) from reference genome
    annotate_regions    Annotate CNV calls with region annotations from BED file
    merge_records       Merge adjacent or nearby CNV records in a VCF file

concat

Concatenate CNV VCF files from different callers (cn.mops and CNVpytor) into a single sorted and indexed VCF. The tool adds "source" tag for each CNV

combine_cnv_vcfs \
  --cnmops_vcf cnmops1.vcf cnmops2.vcf \
  --cnvpytor_vcf cnvpytor1.vcf cnvpytor2.vcf \
  --output_vcf combined.vcf.gz \
  --fasta_index reference.fasta.fai \
  --out_directory ./output

annotate_regions

Annotate CNV calls with custom genomic regions that they overlap. The BED is expected to contain |-separated names of regions in the fourth column. The annotation is added to the info field under tag REGION_ANNOTATION

annotate_regions \
  --input_vcf calls.vcf.gz \
  --output_vcf annotated.vcf.gz \
  --annotation_bed regions.bed

annotate_gaps

Annotate CNV calls with percentage of Ns that they cover. Adds an info tag GAPS_PERCENTAGE

annotate_gaps \
  --calls_vcf calls.vcf.gz \
  --output_vcf annotated.vcf.gz \
  --ref_fasta Homo_sapiens_assembly38.fasta

merge_records

Combines overlapping records in the VCF

merge_records
   --input_vcf calls.vcf.gz \
   --output_vcf calls.combined.vcf.gz \
   --distance 0

analyze_cnv_breakpoint_reads

Analyze single-ended reads at CNV breakpoints to identify supporting evidence for duplications and deletions. Counts of supporting evidence appear as info tags in the VCF

analyze_cnv_breakpoint_reads \
  --vcf-file cnv_calls.vcf.gz \
  --bam-file sample.bam \
  --output-file annotated.vcf.gz \
  --cushion 100 \
  --reference-fasta Homo_sapiens_assembly38.fasta

Somatic CNV Tools (ControlFREEC)

annotate_FREEC_segments

Annotate segments from ControlFREEC output as gain/loss/neutral based on fold-change thresholds.

annotate_FREEC_segments \
  --input_segments_file segments.txt \
  --gain_cutoff 1.03 \
  --loss_cutoff 0.97 \
  --out_directory ./output

Visualization

plot_cnv_results

Generate coverage plots along the genome for germline and tumor samples.

plot_cnv_results \
  --sample_name SAMPLE \
  --germline_cov_file germline_coverage.bed \
  --tumor_cov_file tumor_coverage.bed \
  --cnv_file cnv_calls.bed \
  --out_directory ./plots

plot_FREEC_neutral_AF

Generate histogram of allele frequencies at neutral (non-CNV) locations.

plot_FREEC_neutral_AF \
  --input_file neutral_regions.txt \
  --sample_name SAMPLE \
  --out_directory ./plots

Dependencies

The module depends on:

  • Python 3.11+
  • ugbio_core - Core utilities from this workspace
  • CNVpytor (1.3.1) - Python-based CNV caller
  • cn.mops (R package) - Bayesian CNV detection
  • Bioinformatics tools: samtools, bedtools, bcftools
  • R 4.3.1 with Bioconductor packages

Key R Scripts

The module includes R scripts in the cnmops/ directory. They are used by cn.mops pipeline and are not intended for standalone usage.

  • cnv_calling_using_cnmops.R - Main cn.mops calling script
  • get_reads_count_from_bam.R - Extract read counts from BAM files
  • create_reads_count_cohort_matrix.R - Build cohort matrix for cn.mops
  • normalize_reads_count.R - Normalize read counts across samples

Notes

  • See germline and somatic CNV calling workflows published in GH repository Ultimagen\healthomics-workflows for the reference implementations of the suggested workflows.
  • For optimal CNV calling, use cohort-based approaches when multiple samples are available
  • Filter CNV calls using the provided LCR (low-complexity region) files to reduce false positives
  • Consider minimum CNV length thresholds based on your sequencing depth and biological context
  • The module supports both GRCh37 and GRCh38 reference genomes

Key Components

process_cnvs

Process CNV calls from CN.MOPS or ControlFREEC in BED format: filter by length and UG-CNV-LCR, annotate with coverage statistics, and convert to VCF format.

Note: This module is called programmatically (not via CLI) from other pipeline scripts.

Programmatic Usage

The process_cnvs module is typically invoked from other pipeline components. Here are examples:

Basic usage (minimal filtering):

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001"
])

With LCR filtering and length thresholds:

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--cnv_lcr_file", "ug_cnv_lcr.bed",
    "--min_cnv_length", "10000",
    "--intersection_cutoff", "0.5",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001",
    "--out_directory", "/path/to/output/"
])

Full pipeline with coverage annotations:

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--cnv_lcr_file", "ug_cnv_lcr.bed",
    "--min_cnv_length", "10000",
    "--sample_norm_coverage_file", "sample.normalized_coverage.bed",
    "--cohort_avg_coverage_file", "cohort.average_coverage.bed",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001",
    "--out_directory", "/path/to/output/",
    "--verbosity", "INFO"
])

Input: BED file with CNV calls from CN.MOPS or ControlFREEC

Output: Filtered and annotated VCF file with CNV calls (.vcf.gz and .vcf.gz.tbi)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugbio_cnv-1.21.2.tar.gz (51.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ugbio_cnv-1.21.2-py3-none-any.whl (58.2 kB view details)

Uploaded Python 3

File details

Details for the file ugbio_cnv-1.21.2.tar.gz.

File metadata

  • Download URL: ugbio_cnv-1.21.2.tar.gz
  • Upload date:
  • Size: 51.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_cnv-1.21.2.tar.gz
Algorithm Hash digest
SHA256 9c0d54f2e260913105ef0ad9232a0544901ad6396a4049219acbe97e02542cb8
MD5 31148e4cc21f980394797cac058d8267
BLAKE2b-256 009550ddfd51c56811bb41b94fe2e84574eb848f691d857e4f598cc65b11e5db

See more details on using hashes here.

File details

Details for the file ugbio_cnv-1.21.2-py3-none-any.whl.

File metadata

  • Download URL: ugbio_cnv-1.21.2-py3-none-any.whl
  • Upload date:
  • Size: 58.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_cnv-1.21.2-py3-none-any.whl
Algorithm Hash digest
SHA256 24d4925667198fff1426ed4507b30409557f87e7a2beec7511c9cb07856993ec
MD5 f7b47a00a9ed678356181c2955fc68ea
BLAKE2b-256 f984f56a0ee5b0c8fd5e84ff3d0598e2d4a674c0e5c550daa9b9aad9a37f7e81

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page