Skip to main content

A Python toolkit for copy number visualization and multi-sample comparison

Project description

CNVis

A lightweight Python toolkit for Copy Number Visualization and multi-sample comparison. Designed for publication-quality genome-wide plots with minimal dependencies.

Features

  • Binned coverage analysis from BedGraph, CSV, or BigWig files
  • Multi-sample coverage matrices at gene, chromosome arm, or fixed-bin resolution
  • Publication-quality genome-wide plots with chromosome-proportional layouts
  • Segment-based smoothing using ASCAT or other segmentation results
  • Built-in segmentation using PELT or CBS algorithms for quick exploration
  • Gap filtering with multiple methods (constant fill, neighbor interpolation, removal)
  • Bundled reference data including hg38 gap regions for easy filtering

Requirements

  • Python 3.7 or later
  • pandas, numpy, matplotlib, seaborn
  • bioframe, pyBigWig
  • ruptures (optional, for PELT segmentation)

Installation

pip install git+https://github.com/yelingqun/cnvis.git

Or download the zip from GitHub:

pip install cnvis-main.zip

Quick Start

import cnvis as cv
import pandas as pd

# Build a coverage matrix from multiple samples
matrix = cv.coverage_matrix_bins(
    input_files=['sample1.bedgraph', 'sample2.bedgraph'],
    names=['sample1', 'sample2'],
    bins_size=2_000_000
)

# Plot genome-wide coverage
genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')
cv.plot_coverage(matrix, genome_size, y_column='sample1')

Workflow Guide

Workflow 1: Multi-Sample Coverage Matrix Analysis

This workflow creates binned coverage matrices for comparing multiple samples.

Step 1: Prepare Input Files

CNVis accepts coverage files in these formats:

  • BedGraph: chrom start end value (tab-separated)
  • CSV: Must contain chrom, start, end, and a value column
  • BigWig: Standard bigWig format (.bw)

Step 2: Build Coverage Matrix

import cnvis as cv

# List your coverage files and sample names
input_files = [
    'sample1.bedgraph',
    'sample2.bedgraph',
    'sample3.bedgraph'
]
names = ['sample1', 'sample2', 'sample3']

# Create coverage matrix with 2Mb bins
matrix = cv.coverage_matrix_bins(
    input_files=input_files,
    names=names,
    bins_size=2_000_000,      # 2Mb bins
    max_value=8,               # Clip outliers above 8
    normalize_median=True      # Normalize each sample to median=2
)

Alternative binning options:

# By chromosome arms (p/q arms)
matrix_arms = cv.coverage_matrix_arms(input_files, names, genome='hg38')

# By gene regions
genes_df = pd.read_csv('genes.bed', sep='\t')  # chrom, start, end, name
matrix_genes = cv.coverage_matrix_genes(input_files, names, genes=genes_df)

Step 3: Filter Genomic Gaps

Remove or interpolate coverage in problematic regions (centromeres, gaps, etc.):

# Load bundled hg38 gap regions (included with cnvis)
from importlib.resources import files
gap_file = files('cnvis.data').joinpath('hgTables_gap_hg38.tsv')
gap_df = pd.read_csv(gap_file, sep='\t')[['chrom', 'chromStart', 'chromEnd']]

# Or load your own gap file
# gap_df = pd.read_csv('hg38_gaps.tsv', sep='\t')[['chrom', 'chromStart', 'chromEnd']]

# Filter gaps with 100kb buffer
matrix_filtered = cv.filter_gaps(
    matrix,
    gap_df,
    buffer=100_000,        # Extend gap regions by 100kb
    method='neighbor',     # 'neighbor', 'constant', or 'remove'
    gap_value=2,           # Value to use if method='constant'
    window=3               # Window size for neighbor interpolation
)

Gap filtering methods:

  • 'neighbor': Interpolate using neighboring bin values (recommended)
  • 'constant': Fill with a fixed value (default: 2)
  • 'remove': Drop gap bins entirely from the DataFrame

Step 4: Visualize Coverage

Single sample plot:

import pandas as pd

# Load genome size file (chrom, size columns)
genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')

# Simple plot without color mapping
cv.plot_coverage_points(
    matrix_filtered,
    genome_size,
    y_column='sample1',
    ylim=(0, 4.5),
    alpha=0.3,
    ylabel='Copy Number'
)

# Or with copy number color mapping
palette = cv.get_cn_palette()  # Get default CN color palette
matrix_filtered['color'] = matrix_filtered['sample1'].apply(cv.categorize_cn_color)

cv.plot_coverage_points(
    matrix_filtered,
    genome_size,
    y_column='sample1',
    hue_column='color',
    palette=palette,
    ylim=(0, 4.5),
    alpha=0.3,
    ylabel='Copy Number'
)

Multi-sample comparison:

cv.plot_coverage_multi(
    matrix_filtered,
    genome_size,
    y_columns=['sample1', 'sample2', 'sample3'],
    ylabels=['Sample 1', 'Sample 2', 'Sample 3'],
    chrom_column='chrom',
    x1_column='start',
    hue_column='color',
    palette=palette,
    ylim=(0, 4.5),
    alpha=0.3,
    showX=False
)

Plot specific chromosomes:

cv.plot_coverage_multi(
    matrix_filtered,
    genome_size,
    y_columns=['sample1', 'sample2'],
    chrom=['chr1', 'chr2', 'chr3'],  # Only these chromosomes
    chrom_column='chrom',
    x1_column='start',
    hue_column='color',
    palette=palette
)

Workflow 2: Segment-Based Smoothing with ASCAT

This workflow integrates ASCAT segmentation results to smooth coverage data.

Step 1: Load and Smooth Coverage

import cnvis as cv
import pandas as pd

# Load coverage data
cov = cv.load_coverage_file('sample.csv')

# Load segment data
segment = pd.read_csv('sample.segments.txt', sep='\t')

# Smooth toward segment medians
# smooth=0.9 means 90% toward segment median, 10% original value
cov = cv.smooth_with_segments(
    cov,                         # Coverage DataFrame
    segment,                     # Segment DataFrame
    column='value',              # Input column name
    result_column='value_smoothed',  # Output column name
    smooth=0.9                   # Smoothing factor (0-1)
)

Step 2: Filter with Blood/Normal Control (Optional)

# Load blood/normal coverage for filtering
blood_cov = cv.load_coverage_file('blood_sample.csv')

# Filter out bins with abnormal blood coverage
cov_filtered = cv.filter_cov(
    cov,
    blood_cov,
    value_column='value',
    chrom_column='chrom'
)

Step 3: Convert to Copy Number and Assign Colors

# Convert normalized coverage to copy number (diploid = 2)
cov_filtered['cn'] = (cov_filtered['value_smoothed'] * 2).clip(upper=8)

# Calculate segment median for color assignment
cov_filtered['segment_median'] = cov_filtered.groupby('segment')['cn'].transform('median')

# Assign colors based on copy number state
cov_filtered['color'] = cov_filtered['segment_median'].apply(cv.categorize_cn_color)

Step 4: Plot Smoothed Coverage

palette = cv.get_cn_palette()  # Get default CN color palette

cv.plot_coverage(
    cov_filtered,
    genome_size,
    y_column='cn',
    s=1,                        # Point size
    ylim=(0, 4.5),
    alpha=0.3,
    hue_column='color',
    palette=palette,
    figsize=(5, 0.8),
    ylabel='Copy Number'
)

Workflow 3: Quick Segmentation with Built-in Algorithms

For quick exploration without external tools like ASCAT, CNVis provides built-in segmentation.

Step 1: Load and Normalize Coverage

import cnvis as cv

# Load coverage data
cov = cv.load_coverage_file('sample.bedgraph')

# Normalize (clip outliers, normalize to median=1)
cov = cv.normalize_coverage(cov, max_value=8, normalize_median=True)

Step 2: Run Segmentation

# PELT algorithm (fast, recommended for exploration)
segments = cv.segment_coverage(cov, method='pelt', penalty=3)

# Or CBS algorithm (classic CNV method, slower but well-established)
segments = cv.segment_coverage(cov, method='cbs', alpha=0.01)

Method comparison:

  • 'pelt': Fast change-point detection using the ruptures library. Good for quick exploration.
  • 'cbs': Circular Binary Segmentation, the classic algorithm for array CGH data (Olshen et al., 2004). Uses permutation tests for significance.

Common parameters:

  • penalty: For PELT, higher values = fewer breakpoints (default: 3)
  • alpha: For CBS, significance level (default: 0.01)
  • min_size: Minimum segment size in bins (default: 5)
  • merge_segments: Merge adjacent segments that aren't statistically different (default: True)

Step 3: Visualize Segments

import pandas as pd

genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')

# Plot segments as horizontal lines
cv.plot_segments(segments, genome_size, y_column='cn', ylim=(0, 4.5))

API Reference

Coverage Processing Functions

Function Description
normalize_coverage(track, max_value=8, normalize_median=True, target_median=1.0) Clip and/or normalize coverage values
filter_gaps(df, gap, buffer=500_000, method='constant') Filter genomic gap regions (methods: 'constant', 'neighbor', 'remove')
filter_cov(cov, blood_cov) Filter using control sample
smooth_with_segments(cov, segment, smooth=0.9) Segment-based smoothing
segment_coverage(cov, method='pelt') Segment coverage using PELT or CBS algorithm
merge_similar_segments(segments, p_threshold=0.05) Merge adjacent segments that aren't statistically different

Coverage Matrix Functions

Function Description
coverage_matrix_bins(input_files, names, bins_size=2_000_000) Create matrix with fixed-size bins
coverage_matrix_arms(input_files, names, genome='hg38') Create matrix by chromosome arms
coverage_matrix_genes(input_files, names, genes) Create matrix by gene regions
coverage_by_bins(input_file, name, bins) Process single sample
matrix2comut(matrix, low=1.25, high=2.75) Convert to CoMut format

Plotting Functions

Function Description
plot_coverage(df, genome_size, y_column, ...) Single-sample genome-wide plot (main function)
plot_coverage_points(df, genome_size, y_column, ...) Scatter plot wrapper (simplified API)
plot_coverage_lines(df, genome_size, y_column, ...) Line segment wrapper (simplified API)
plot_segments(segments, genome_size, y_column='cn', ...) Plot segmentation results as horizontal lines
plot_coverage_multi(df, genome_size, y_columns, ...) Multi-sample stacked plots
categorize_cn_color(value) Map CN value to color category
get_cn_palette() Get default CN color palette
extract_highlighted_coverage(df, highlight_df, ...) Extract coverage from highlighted regions

Utility Functions

Function Description
load_coverage_file(input_file, chrom_col, start_col, end_col, value_col) Load BedGraph/CSV/TSV/BigWig file
genome_range(version='GRCh38') Get chromosome ranges
genome_bins(coord_df, bin_size) Generate genomic bins

Bundled Reference Data

CNVis includes reference data files for hg38:

from importlib.resources import files

# hg38 gap regions (centromeres, telomeres, scaffold gaps)
gap_file = files('cnvis.data').joinpath('hgTables_gap_hg38.tsv')

# GRCh38 chromosome sizes
genome_file = files('cnvis.data').joinpath('GRCh38.genome.size.tsv')

Plot Customization

Styling Options

# Style: spine separators between chromosomes
cv.plot_coverage_multi(df, genome_size, y_columns, style='spine')

# Style: alternating background colors
cv.plot_coverage_multi(
    df, genome_size, y_columns,
    style='facecolor',
    facecolor_odd='#e6f2ff',
    facecolor_even='#ffffff'
)

Common Parameters

Parameter Description
ylim Y-axis limits, e.g., (0, 4.5)
alpha Point transparency (0-1)
s Point size
figsize Figure size as (width, height)
showX Show x-axis labels
ylabel Y-axis label
highlight_df DataFrame of regions to highlight
highlight_color Color for highlighted regions

Plot Type Selection

CNVis provides wrapper functions for common plot types:

# Scatter plot (points) - best for binned coverage data
cv.plot_coverage_points(df, genome_size, y_column='value', alpha=0.3)

# Line segments - best for segment-level data with start/end coordinates
cv.plot_coverage_lines(df, genome_size, y_column='value', x2_column='end')

# Full control - use the main function directly
cv.plot_coverage(df, genome_size, y_column='value', x2_column='end', ...)

Example Notebooks

See the notebooks/ directory for complete examples:

  • segmentation_algorithms_explained.ipynb - In-depth guide to PELT and CBS algorithms
  • test_segments.ipynb - Quick segmentation usage examples
  • test_coverage_matrix_2m.ipynb - Multi-sample coverage analysis
  • test_coverage_plot_smoothed.ipynb - Segment-based smoothing
  • test_coverage_matrix_plot_hic_vs_wgs.ipynb - HiC vs WGS comparison
  • test_pacbio_coverage_plot.ipynb - Long-read coverage plotting

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cnvis-0.1.0.tar.gz (40.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cnvis-0.1.0-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file cnvis-0.1.0.tar.gz.

File metadata

  • Download URL: cnvis-0.1.0.tar.gz
  • Upload date:
  • Size: 40.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for cnvis-0.1.0.tar.gz
Algorithm Hash digest
SHA256 07bc5df8b8fadf606181b9aade6b4041574b03bbb2631a476f8ee665de25b2cb
MD5 e325ce27c023c9411cc87570d402c2d3
BLAKE2b-256 99bd60d940bb462770d98684e7eb506a5cf59d4bff322085a7d22d26310e40c4

See more details on using hashes here.

File details

Details for the file cnvis-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cnvis-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for cnvis-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ea5e20df60def7301d0af32a315baab649b323eaa8ddd32e19e5c8c067822d00
MD5 468f001bd924e06f83c479e92b0510aa
BLAKE2b-256 58fc1387c2013f21fd00021b533fb1460bd0c3f3623d4f3290375f3ee567fed1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page