Ultima Genomics CNV utils

Project description

ugbio_cnv

This module provides Python scripts and utilities for Copy Number Variation (CNV) analysis in bioinformatics pipelines.

Overview

The CNV module integrates multiple CNV calling algorithms and provides tools for processing, filtering, combining, annotating, and visualizing CNV calls. It supports both germline and somatic CNV detection workflows.

Note that the package itself does not call variants, it just provides facilities for preparing data, calling, changing the format and combining callsets.

The package is designed to work with the following CNV callers:

cn.mops - Read depth-based CNV caller using a Bayesian approach
CNVpytor - Read depth analysis for CNV detection
ControlFREEC - Control-FREEC for somatic CNV detection

Installation

Using UV (Recommended)

Install the CNV module and its dependencies:

pip install ugbio-cnv

Pre-built docker image can be downloaded from Dockerhub: ultimagenomics/ugbio_cnv

Available Tools

CNV Processing

`process_cnmops_cnvs`

Process CNV calls in BED format from cn.mops and ControlFREEC: filter by length and low-complexity regions, annotate, and convert to VCF format.

process_cnmops_cnvs \
  --input_bed_file cnmops_calls.bed \
  --cnv_lcr_file ug_cnv_lcr.bed \
  --min_cnv_length 10000 \
  --intersection_cutoff 0.5 \
  --out_directory ./output

Key Parameters:

--input_bed_file - Input BED file from cn.mops
--cnv_lcr_file - UG-CNV-LCR BED file for filtering low-complexity regions (see workflows for the BED)
--min_cnv_length - CNVs below this length will be marked (default: 10000)
--intersection_cutoff - Overlap threshold with the cnv lcr(default: 0.5)

Combining CNV Calls

Tools for combining and analyzing CNV calls (currently implemented combining of CNV calls from cn.mops and CNVPytor) are all aggregated under CLI interface combine_cnmops_cnvpytor_cnv_calls. This CLI contains the following tools - each can also be called by a standalone script:

    concat              Combine CNV VCFs from different callers (cn.mops and cnvpytor)
    filter_cnmops_dups  Filter short duplications from cn.mops calls in the combined CNV VCF
    annotate_gaps       Annotate CNV calls with percentage of gaps (Ns) from reference genome
    annotate_regions    Annotate CNV calls with region annotations from BED file
    merge_records       Merge adjacent or nearby CNV records in a VCF file

`concat`

Concatenate CNV VCF files from different callers (cn.mops and CNVpytor) into a single sorted and indexed VCF. The tool adds "source" tag for each CNV

combine_cnv_vcfs \
  --cnmops_vcf cnmops1.vcf cnmops2.vcf \
  --cnvpytor_vcf cnvpytor1.vcf cnvpytor2.vcf \
  --output_vcf combined.vcf.gz \
  --fasta_index reference.fasta.fai \
  --out_directory ./output

`annotate_regions`

Annotate CNV calls with custom genomic regions that they overlap. The BED is expected to contain |-separated names of regions in the fourth column. The annotation is added to the info field under tag REGION_ANNOTATION

annotate_regions \
  --input_vcf calls.vcf.gz \
  --output_vcf annotated.vcf.gz \
  --annotation_bed regions.bed

`annotate_gaps`

Annotate CNV calls with percentage of Ns that they cover. Adds an info tag GAPS_PERCENTAGE

annotate_gaps \
  --calls_vcf calls.vcf.gz \
  --output_vcf annotated.vcf.gz \
  --ref_fasta Homo_sapiens_assembly38.fasta

`merge_records`

Combines overlapping records in the VCF

merge_records
   --input_vcf calls.vcf.gz \
   --output_vcf calls.combined.vcf.gz \
   --distance 0

`analyze_cnv_breakpoint_reads`

Analyze single-ended reads at CNV breakpoints to identify supporting evidence for duplications and deletions. Counts of supporting evidence appear as info tags in the VCF

analyze_cnv_breakpoint_reads \
  --vcf-file cnv_calls.vcf.gz \
  --bam-file sample.bam \
  --output-file annotated.vcf.gz \
  --cushion 100 \
  --reference-fasta Homo_sapiens_assembly38.fasta

Somatic CNV Tools (ControlFREEC)

`annotate_FREEC_segments`

Annotate segments from ControlFREEC output as gain/loss/neutral based on fold-change thresholds.

annotate_FREEC_segments \
  --input_segments_file segments.txt \
  --gain_cutoff 1.03 \
  --loss_cutoff 0.97 \
  --out_directory ./output

Visualization

`plot_cnv_results`

Generate coverage plots along the genome for germline and tumor samples.

plot_cnv_results \
  --sample_name SAMPLE \
  --germline_cov_file germline_coverage.bed \
  --tumor_cov_file tumor_coverage.bed \
  --cnv_file cnv_calls.bed \
  --out_directory ./plots

`plot_FREEC_neutral_AF`

Generate histogram of allele frequencies at neutral (non-CNV) locations.

plot_FREEC_neutral_AF \
  --input_file neutral_regions.txt \
  --sample_name SAMPLE \
  --out_directory ./plots

Dependencies

The module depends on:

Python 3.11+
ugbio_core - Core utilities from this workspace
CNVpytor (1.3.1) - Python-based CNV caller
cn.mops (R package) - Bayesian CNV detection
Bioinformatics tools: samtools, bedtools, bcftools
R 4.3.1 with Bioconductor packages

Key R Scripts

The module includes R scripts in the cnmops/ directory. They are used by cn.mops pipeline and are not intended for standalone usage.

cnv_calling_using_cnmops.R - Main cn.mops calling script
get_reads_count_from_bam.R - Extract read counts from BAM files
create_reads_count_cohort_matrix.R - Build cohort matrix for cn.mops
normalize_reads_count.R - Normalize read counts across samples
rebin_cohort_reads_count.R - Re-bin existing cohort to larger bin sizes

Re-binning CNmops Cohorts

`rebin_cohort_reads_count.R`

Re-bin an existing cn.mops cohort from smaller bins to larger bins by aggregating read counts. This allows you to adjust the resolution of existing cohorts without regenerating from BAM files, which is useful for:

Reducing computational memory requirements for large cohorts
Faster CNV calling with coarser resolution
Testing different bin sizes without re-processing BAM files

Usage:

Rscript cnmops/rebin_cohort_reads_count.R \
  -i cohort_1000bp.rds \
  -owl 1000 \
  -nwl 5000 \
  -o cohort_5000bp.rds \
  --save_csv

Parameters:

-i, --input_cohort_file - Input cohort RDS file (required)
-owl, --original_window_length - Original bin size in bp (optional), autodetected if not given
-nwl, --new_window_length - New bin size in bp (required, must be divisible by original)
-o, --output_file - Output RDS file (default: rebinned_cohort_reads_count.rds)
--save_csv - Also save as CSV format
--save_hdf - Also save as HDF5 format

Important Notes:

New window length must be larger than and divisible by the original window length
Genomic coordinates use 1-based, right-closed format (e.g., 1-1000, 1001-2000, ...)
Partial bins at chromosome ends are preserved without artificial extension
Read counts are summed from all original bins within each new bin
Total read counts per sample are preserved across the rebinning

Example:

# Re-bin HapMap2 cohort from 1000 bp to 5000 bp
Rscript cnmops/rebin_cohort_reads_count.R \
  -i HapMap2_65samples_cohort_v2.0.hg38.ReadsCount.rds \
  -nwl 5000 \
  -o HapMap2_65samples_cohort_v2.0.hg38.ReadsCount.5000bp.rds

# Result: 3,088,281 bins → 617,665 bins (5x reduction)
# Last bin on chr1: 248955001-248956422 (partial bin, not extended to 248960000)

Notes

See germline and somatic CNV calling workflows published in GH repository Ultimagen\healthomics-workflows for the reference implementations of the suggested workflows.
For optimal CNV calling, use cohort-based approaches when multiple samples are available
Filter CNV calls using the provided LCR (low-complexity region) files to reduce false positives
Consider minimum CNV length thresholds based on your sequencing depth and biological context
The module supports both GRCh37 and GRCh38 reference genomes

Key Components

process_cnvs

Process CNV calls from CN.MOPS or ControlFREEC in BED format: filter by length and UG-CNV-LCR, annotate with coverage statistics, and convert to VCF format.

Note: This module is called programmatically (not via CLI) from other pipeline scripts.

Programmatic Usage

The process_cnvs module is typically invoked from other pipeline components. Here are examples:

Basic usage (minimal filtering):

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001"
])

With LCR filtering and length thresholds:

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--cnv_lcr_file", "ug_cnv_lcr.bed",
    "--min_cnv_length", "10000",
    "--intersection_cutoff", "0.5",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001",
    "--out_directory", "/path/to/output/"
])

Full pipeline with coverage annotations:

from ugbio_cnv import process_cnvs

process_cnvs.run([
    "process_cnvs",
    "--input_bed_file", "cnv_calls.bed",
    "--cnv_lcr_file", "ug_cnv_lcr.bed",
    "--min_cnv_length", "10000",
    "--sample_norm_coverage_file", "sample.normalized_coverage.bed",
    "--cohort_avg_coverage_file", "cohort.average_coverage.bed",
    "--fasta_index_file", "reference.fasta.fai",
    "--sample_name", "sample_001",
    "--out_directory", "/path/to/output/",
    "--verbosity", "INFO"
])

Input: BED file with CNV calls from CN.MOPS or ControlFREEC

Output: Filtered and annotated VCF file with CNV calls (.vcf.gz and .vcf.gz.tbi)

Project details

Release history Release notifications | RSS feed

1.24.2

May 7, 2026

1.24.1

Apr 28, 2026

1.24.0

Apr 23, 2026

1.23.0

Apr 16, 2026

1.22.2

Mar 22, 2026

1.22.1

Mar 22, 2026

1.22.0

Mar 19, 2026

1.21.4

Mar 1, 2026

This version

1.21.3

Mar 1, 2026

1.21.2

Feb 19, 2026

1.21.1

Feb 17, 2026

1.21.0

Feb 17, 2026

1.20.0

Jan 25, 2026

1.19.0

Jan 4, 2026

1.18.0

Dec 24, 2025

1.17.2

Nov 23, 2025

1.17.1

Nov 20, 2025

1.17.0

Nov 20, 2025

1.16.2

Oct 30, 2025

1.16.1

Oct 21, 2025

1.16.0

Oct 21, 2025

1.15.0

Sep 14, 2025

1.14.0

Aug 28, 2025

1.13.2

Aug 21, 2025

1.13.1

Aug 17, 2025

1.13.0

Aug 7, 2025

1.12.0

Jul 17, 2025

1.11.0

Jun 23, 2025

1.10.2

May 28, 2025

1.10.1

May 22, 2025

1.10.0

May 21, 2025

1.8.0

Mar 18, 2025

1.7.0

Mar 17, 2025

1.6.1

Feb 18, 2025

1.6.0

Feb 13, 2025

1.5.5

Feb 2, 2025

1.5.4

Jan 21, 2025

1.5.3

Jan 13, 2025

1.5.2

Jan 8, 2025

1.5.1

Jan 8, 2025

1.5.0

Jan 7, 2025

1.4.3

Dec 29, 2024

1.4.2

Dec 22, 2024

1.4.1

Dec 22, 2024

1.4.0

Dec 9, 2024

1.3.4

Nov 19, 2024

1.3.3

Nov 7, 2024

1.3.2

Nov 5, 2024

1.3.1

Nov 3, 2024

1.3.0

Oct 31, 2024

1.2.0

Oct 10, 2024

1.1.0

Sep 17, 2024

1.0.1

Sep 5, 2024

1.0.0

Sep 5, 2024

0.1.0.dev13 pre-release

Jul 24, 2024

0.1.0.dev12 pre-release

Jul 18, 2024

0.1.0.dev11 pre-release

Jul 7, 2024

0.1.0.dev10 pre-release

Jul 1, 2024

0.1.0.dev7 pre-release

Jun 27, 2024

0.1.0.dev6 pre-release

Jun 26, 2024

0.1.0.dev5 pre-release

Jun 26, 2024

0.1.0.dev4 pre-release

Jun 26, 2024

0.1.0.dev3 pre-release

Jun 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ugbio_cnv-1.21.3.tar.gz (54.5 kB view details)

Uploaded Mar 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ugbio_cnv-1.21.3-py3-none-any.whl (60.2 kB view details)

Uploaded Mar 1, 2026 Python 3

File details

Details for the file ugbio_cnv-1.21.3.tar.gz.

File metadata

Download URL: ugbio_cnv-1.21.3.tar.gz
Upload date: Mar 1, 2026
Size: 54.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_cnv-1.21.3.tar.gz
Algorithm	Hash digest
SHA256	`618a7025f2527b92a555ede513dfb5577bbbd0e9ac3463865bf080bed6f351bc`
MD5	`1799d77346b2de4f0bf9947e05bca75b`
BLAKE2b-256	`76df4c190fdb06cee24a89712ffe460a1ed77a41d7b5df7d0165e4ac19815b34`

See more details on using hashes here.

File details

Details for the file ugbio_cnv-1.21.3-py3-none-any.whl.

File metadata

Download URL: ugbio_cnv-1.21.3-py3-none-any.whl
Upload date: Mar 1, 2026
Size: 60.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ugbio_cnv-1.21.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e34a235efa4e3e72a91c1059d39faefcf35188194432b284c1dc63b99f92d34c`
MD5	`b302cd1a3fc379da42c2f0fb41c435f9`
BLAKE2b-256	`abd5e96d8663dad4069b322fc191f0dbd3c048cf5b9f99f56b49772857a55f0f`

See more details on using hashes here.

ugbio-cnv 1.21.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

ugbio_cnv

Overview

Installation

Using UV (Recommended)

Available Tools

CNV Processing

process_cnmops_cnvs

Combining CNV Calls

concat

annotate_regions

annotate_gaps

merge_records

analyze_cnv_breakpoint_reads

Somatic CNV Tools (ControlFREEC)

annotate_FREEC_segments

Visualization

plot_cnv_results

plot_FREEC_neutral_AF

Dependencies

Key R Scripts

Re-binning CNmops Cohorts

rebin_cohort_reads_count.R

Notes

Key Components

process_cnvs

Programmatic Usage

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`process_cnmops_cnvs`

`concat`

`annotate_regions`

`annotate_gaps`

`merge_records`

`analyze_cnv_breakpoint_reads`

`annotate_FREEC_segments`

`plot_cnv_results`

`plot_FREEC_neutral_AF`

`rebin_cohort_reads_count.R`