A Python toolkit for copy number visualization and multi-sample comparison
Project description
CNVis
A lightweight Python toolkit for Copy Number Visualization and multi-sample comparison. Designed for publication-quality genome-wide plots with minimal dependencies.
Features
- Binned coverage analysis from BedGraph, CSV, or BigWig files
- Multi-sample coverage matrices at gene, chromosome arm, or fixed-bin resolution
- Publication-quality genome-wide plots with chromosome-proportional layouts
- Segment-based smoothing using ASCAT or other segmentation results
- Built-in segmentation using PELT or CBS algorithms for quick exploration
- Gap filtering with multiple methods (constant fill, neighbor interpolation, removal)
- Bundled reference data including hg38 gap regions for easy filtering
Requirements
- Python 3.7 or later
- pandas, numpy, matplotlib, seaborn
- bioframe, pyBigWig
- ruptures (optional, for PELT segmentation)
Installation
pip install git+https://github.com/yelingqun/cnvis.git
Or download the zip from GitHub:
pip install cnvis-main.zip
Quick Start
import cnvis as cv
import pandas as pd
# Build a coverage matrix from multiple samples
matrix = cv.coverage_matrix_bins(
input_files=['sample1.bedgraph', 'sample2.bedgraph'],
names=['sample1', 'sample2'],
bins_size=2_000_000
)
# Plot genome-wide coverage
genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')
cv.plot_coverage(matrix, genome_size, y_column='sample1')
Workflow Guide
Workflow 1: Multi-Sample Coverage Matrix Analysis
This workflow creates binned coverage matrices for comparing multiple samples.
Step 1: Prepare Input Files
CNVis accepts coverage files in these formats:
- BedGraph:
chrom start end value(tab-separated) - CSV: Must contain
chrom,start,end, and a value column - BigWig: Standard bigWig format (
.bw)
Step 2: Build Coverage Matrix
import cnvis as cv
# List your coverage files and sample names
input_files = [
'sample1.bedgraph',
'sample2.bedgraph',
'sample3.bedgraph'
]
names = ['sample1', 'sample2', 'sample3']
# Create coverage matrix with 2Mb bins
matrix = cv.coverage_matrix_bins(
input_files=input_files,
names=names,
bins_size=2_000_000, # 2Mb bins
max_value=8, # Clip outliers above 8
normalize_median=True # Normalize each sample to median=2
)
Alternative binning options:
# By chromosome arms (p/q arms)
matrix_arms = cv.coverage_matrix_arms(input_files, names, genome='hg38')
# By gene regions
genes_df = pd.read_csv('genes.bed', sep='\t') # chrom, start, end, name
matrix_genes = cv.coverage_matrix_genes(input_files, names, genes=genes_df)
Step 3: Filter Genomic Gaps
Remove or interpolate coverage in problematic regions (centromeres, gaps, etc.):
# Load bundled hg38 gap regions (included with cnvis)
from importlib.resources import files
gap_file = files('cnvis.data').joinpath('hgTables_gap_hg38.tsv')
gap_df = pd.read_csv(gap_file, sep='\t')[['chrom', 'chromStart', 'chromEnd']]
# Or load your own gap file
# gap_df = pd.read_csv('hg38_gaps.tsv', sep='\t')[['chrom', 'chromStart', 'chromEnd']]
# Filter gaps with 100kb buffer
matrix_filtered = cv.filter_gaps(
matrix,
gap_df,
buffer=100_000, # Extend gap regions by 100kb
method='neighbor', # 'neighbor', 'constant', or 'remove'
gap_value=2, # Value to use if method='constant'
window=3 # Window size for neighbor interpolation
)
Gap filtering methods:
'neighbor': Interpolate using neighboring bin values (recommended)'constant': Fill with a fixed value (default: 2)'remove': Drop gap bins entirely from the DataFrame
Step 4: Visualize Coverage
Single sample plot:
import pandas as pd
# Load genome size file (chrom, size columns)
genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')
# Simple plot without color mapping
cv.plot_coverage_points(
matrix_filtered,
genome_size,
y_column='sample1',
ylim=(0, 4.5),
alpha=0.3,
ylabel='Copy Number'
)
# Or with copy number color mapping
palette = cv.get_cn_palette() # Get default CN color palette
matrix_filtered['color'] = matrix_filtered['sample1'].apply(cv.categorize_cn_color)
cv.plot_coverage_points(
matrix_filtered,
genome_size,
y_column='sample1',
hue_column='color',
palette=palette,
ylim=(0, 4.5),
alpha=0.3,
ylabel='Copy Number'
)
Multi-sample comparison:
cv.plot_coverage_multi(
matrix_filtered,
genome_size,
y_columns=['sample1', 'sample2', 'sample3'],
ylabels=['Sample 1', 'Sample 2', 'Sample 3'],
chrom_column='chrom',
x1_column='start',
hue_column='color',
palette=palette,
ylim=(0, 4.5),
alpha=0.3,
showX=False
)
Plot specific chromosomes:
cv.plot_coverage_multi(
matrix_filtered,
genome_size,
y_columns=['sample1', 'sample2'],
chrom=['chr1', 'chr2', 'chr3'], # Only these chromosomes
chrom_column='chrom',
x1_column='start',
hue_column='color',
palette=palette
)
Workflow 2: Segment-Based Smoothing with ASCAT
This workflow integrates ASCAT segmentation results to smooth coverage data.
Step 1: Load and Smooth Coverage
import cnvis as cv
import pandas as pd
# Load coverage data
cov = cv.load_coverage_file('sample.csv')
# Load segment data
segment = pd.read_csv('sample.segments.txt', sep='\t')
# Smooth toward segment medians
# smooth=0.9 means 90% toward segment median, 10% original value
cov = cv.smooth_with_segments(
cov, # Coverage DataFrame
segment, # Segment DataFrame
column='value', # Input column name
result_column='value_smoothed', # Output column name
smooth=0.9 # Smoothing factor (0-1)
)
Step 2: Filter with Blood/Normal Control (Optional)
# Load blood/normal coverage for filtering
blood_cov = cv.load_coverage_file('blood_sample.csv')
# Filter out bins with abnormal blood coverage
cov_filtered = cv.filter_cov(
cov,
blood_cov,
value_column='value',
chrom_column='chrom'
)
Step 3: Convert to Copy Number and Assign Colors
# Convert normalized coverage to copy number (diploid = 2)
cov_filtered['cn'] = (cov_filtered['value_smoothed'] * 2).clip(upper=8)
# Calculate segment median for color assignment
cov_filtered['segment_median'] = cov_filtered.groupby('segment')['cn'].transform('median')
# Assign colors based on copy number state
cov_filtered['color'] = cov_filtered['segment_median'].apply(cv.categorize_cn_color)
Step 4: Plot Smoothed Coverage
palette = cv.get_cn_palette() # Get default CN color palette
cv.plot_coverage(
cov_filtered,
genome_size,
y_column='cn',
s=1, # Point size
ylim=(0, 4.5),
alpha=0.3,
hue_column='color',
palette=palette,
figsize=(5, 0.8),
ylabel='Copy Number'
)
Workflow 3: Quick Segmentation with Built-in Algorithms
For quick exploration without external tools like ASCAT, CNVis provides built-in segmentation.
Step 1: Load and Normalize Coverage
import cnvis as cv
# Load coverage data
cov = cv.load_coverage_file('sample.bedgraph')
# Normalize (clip outliers, normalize to median=1)
cov = cv.normalize_coverage(cov, max_value=8, normalize_median=True)
Step 2: Run Segmentation
# PELT algorithm (fast, recommended for exploration)
segments = cv.segment_coverage(cov, method='pelt', penalty=3)
# Or CBS algorithm (classic CNV method, slower but well-established)
segments = cv.segment_coverage(cov, method='cbs', alpha=0.01)
Method comparison:
'pelt': Fast change-point detection using the ruptures library. Good for quick exploration.'cbs': Circular Binary Segmentation, the classic algorithm for array CGH data (Olshen et al., 2004). Uses permutation tests for significance.
Common parameters:
penalty: For PELT, higher values = fewer breakpoints (default: 3)alpha: For CBS, significance level (default: 0.01)min_size: Minimum segment size in bins (default: 5)merge_segments: Merge adjacent segments that aren't statistically different (default: True)
Step 3: Visualize Segments
import pandas as pd
genome_size = pd.read_csv('hg38_genome_size.tsv', sep='\t')
# Plot segments as horizontal lines
cv.plot_segments(segments, genome_size, y_column='cn', ylim=(0, 4.5))
API Reference
Coverage Processing Functions
| Function | Description |
|---|---|
normalize_coverage(track, max_value=8, normalize_median=True, target_median=1.0) |
Clip and/or normalize coverage values |
filter_gaps(df, gap, buffer=500_000, method='constant') |
Filter genomic gap regions (methods: 'constant', 'neighbor', 'remove') |
filter_cov(cov, blood_cov) |
Filter using control sample |
smooth_with_segments(cov, segment, smooth=0.9) |
Segment-based smoothing |
segment_coverage(cov, method='pelt') |
Segment coverage using PELT or CBS algorithm |
merge_similar_segments(segments, p_threshold=0.05) |
Merge adjacent segments that aren't statistically different |
Coverage Matrix Functions
| Function | Description |
|---|---|
coverage_matrix_bins(input_files, names, bins_size=2_000_000) |
Create matrix with fixed-size bins |
coverage_matrix_arms(input_files, names, genome='hg38') |
Create matrix by chromosome arms |
coverage_matrix_genes(input_files, names, genes) |
Create matrix by gene regions |
coverage_by_bins(input_file, name, bins) |
Process single sample |
matrix2comut(matrix, low=1.25, high=2.75) |
Convert to CoMut format |
Plotting Functions
| Function | Description |
|---|---|
plot_coverage(df, genome_size, y_column, ...) |
Single-sample genome-wide plot (main function) |
plot_coverage_points(df, genome_size, y_column, ...) |
Scatter plot wrapper (simplified API) |
plot_coverage_lines(df, genome_size, y_column, ...) |
Line segment wrapper (simplified API) |
plot_segments(segments, genome_size, y_column='cn', ...) |
Plot segmentation results as horizontal lines |
plot_coverage_multi(df, genome_size, y_columns, ...) |
Multi-sample stacked plots |
categorize_cn_color(value) |
Map CN value to color category |
get_cn_palette() |
Get default CN color palette |
extract_highlighted_coverage(df, highlight_df, ...) |
Extract coverage from highlighted regions |
Utility Functions
| Function | Description |
|---|---|
load_coverage_file(input_file, chrom_col, start_col, end_col, value_col) |
Load BedGraph/CSV/TSV/BigWig file |
genome_range(version='GRCh38') |
Get chromosome ranges |
genome_bins(coord_df, bin_size) |
Generate genomic bins |
Bundled Reference Data
CNVis includes reference data files for hg38:
from importlib.resources import files
# hg38 gap regions (centromeres, telomeres, scaffold gaps)
gap_file = files('cnvis.data').joinpath('hgTables_gap_hg38.tsv')
# GRCh38 chromosome sizes
genome_file = files('cnvis.data').joinpath('GRCh38.genome.size.tsv')
Plot Customization
Styling Options
# Style: spine separators between chromosomes
cv.plot_coverage_multi(df, genome_size, y_columns, style='spine')
# Style: alternating background colors
cv.plot_coverage_multi(
df, genome_size, y_columns,
style='facecolor',
facecolor_odd='#e6f2ff',
facecolor_even='#ffffff'
)
Common Parameters
| Parameter | Description |
|---|---|
ylim |
Y-axis limits, e.g., (0, 4.5) |
alpha |
Point transparency (0-1) |
s |
Point size |
figsize |
Figure size as (width, height) |
showX |
Show x-axis labels |
ylabel |
Y-axis label |
highlight_df |
DataFrame of regions to highlight |
highlight_color |
Color for highlighted regions |
Plot Type Selection
CNVis provides wrapper functions for common plot types:
# Scatter plot (points) - best for binned coverage data
cv.plot_coverage_points(df, genome_size, y_column='value', alpha=0.3)
# Line segments - best for segment-level data with start/end coordinates
cv.plot_coverage_lines(df, genome_size, y_column='value', x2_column='end')
# Full control - use the main function directly
cv.plot_coverage(df, genome_size, y_column='value', x2_column='end', ...)
Example Notebooks
See the notebooks/ directory for complete examples:
segmentation_algorithms_explained.ipynb- In-depth guide to PELT and CBS algorithmstest_segments.ipynb- Quick segmentation usage examplestest_coverage_matrix_2m.ipynb- Multi-sample coverage analysistest_coverage_plot_smoothed.ipynb- Segment-based smoothingtest_coverage_matrix_plot_hic_vs_wgs.ipynb- HiC vs WGS comparisontest_pacbio_coverage_plot.ipynb- Long-read coverage plotting
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file cnvis-0.1.0.tar.gz.
File metadata
- Download URL: cnvis-0.1.0.tar.gz
- Upload date:
- Size: 40.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07bc5df8b8fadf606181b9aade6b4041574b03bbb2631a476f8ee665de25b2cb
|
|
| MD5 |
e325ce27c023c9411cc87570d402c2d3
|
|
| BLAKE2b-256 |
99bd60d940bb462770d98684e7eb506a5cf59d4bff322085a7d22d26310e40c4
|
File details
Details for the file cnvis-0.1.0-py3-none-any.whl.
File metadata
- Download URL: cnvis-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ea5e20df60def7301d0af32a315baab649b323eaa8ddd32e19e5c8c067822d00
|
|
| MD5 |
468f001bd924e06f83c479e92b0510aa
|
|
| BLAKE2b-256 |
58fc1387c2013f21fd00021b533fb1460bd0c3f3623d4f3290375f3ee567fed1
|