Skip to main content

cnvturbo: GPU/Numba-accelerated scRNA-seq CNV inference with HMM i6 cell-level tumor calling and R inferCNV-compatible raw-count pipeline. Fully compatible with Scanpy/AnnData. 10-100x faster than alternatives.

Project description

cnvturbo

PyPI version Python 3.10+ License: BSD-3 Scanpy compatible

cnvturbo — A Python re-implementation of R inferCNV for single-cell RNA-seq copy-number variation analysis. Algorithmically faithful to R inferCNV's HMM i6 pipeline, ~100× faster, and fully integrated with the Scanpy / AnnData ecosystem.

Rewritten in pure Python with R-exact algorithm alignment (hspike emission calibration, gene-level Viterbi in copy-ratio space, R-equivalent denoise + subcluster Tumor calling), plus Numba/CUDA-accelerated kernels.


Why cnvturbo?

Feature R inferCNV infercnvpy cnvturbo
Cell-level Tumor/Normal HMM ✗ (cluster score only)
HMM i6 + hspike emission ✓ (analytic + MAD-robust)
Per-chromosome Viterbi (copy-ratio)
Denoise (segment-length filter)
Reference subcluster handling partial
GPU / Numba acceleration
Runtime (P12, 7,269 cells) ~5 hr ~9 min ~86 s
Cell-level concordance with R 1.000 (ref) 0.81 1.000

Verified on 3 PDAC samples (15,135 cells total): cell-level Tumor/Normal classification 100% identical to R inferCNV's HMM output, while running 100–200× faster. See Benchmark below.


Installation

From PyPI (recommended)

pip install cnvturbo

With acceleration backends

# CPU acceleration (Numba)
pip install "cnvturbo[hmm-cpu]"

# GPU acceleration (PyTorch)
pip install "cnvturbo[hmm-gpu]"

# All accelerators + EM fitting
pip install "cnvturbo[hmm]"

Development install

git clone https://github.com/LogicByteCraft/cnvturbo.git
cd cnvturbo
pip install -e ".[dev,test]"

Requirements

  • Python ≥ 3.10
  • scanpy ≥ 1.10, anndata ≥ 0.7.3, numpy ≥ 1.20, pandas ≥ 1
  • Optional: numba ≥ 0.57 (CPU), torch ≥ 2.0 (GPU), hmmlearn ≥ 0.3 (EM)

Quick start

import scanpy as sc
import cnvturbo
from cnvturbo import tl as cnv_tl, pl as cnv_pl

adata = sc.read_h5ad("my_sample.h5ad")
adata.layers["counts"] = adata.X.copy()

cnv_tl.infercnv_r_compat(
    adata,
    raw_layer="counts",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial", "Fibroblast"],
    window_size=101,
    apply_2x_transform=True,
    n_jobs=16,
)

emit_means, emit_stds = cnv_tl.compute_hspike_emission_params(
    adata,
    raw_layer="counts",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial", "Fibroblast"],
    output_space="copy_ratio",
)

cnv_tl.hmm_call_subclusters(
    adata,
    use_rep="cnv",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial", "Fibroblast"],
    precomputed_emit_means=emit_means,
    precomputed_emit_stds=emit_stds,
    leiden_resolution="auto",
    cluster_by_groups=True,
    min_segment_length=5,
    min_segments_for_tumor=1,
    key_added="cnv_call",
    n_jobs=16,
)

print(adata.obs["cnv_call"].value_counts())

After this, adata.obs["cnv_call"] contains "Tumor" / "Normal" per cell, and adata.obs["cnv_call_score"] carries a continuous CNV burden score (mean(|X_cnv − 1.0|) in copy-ratio space).


Detailed usage

1. Prepare AnnData

cnvturbo requires:

  • Raw integer counts in adata.X or adata.layers["counts"].
  • Gene coordinates in adata.var: columns chromosome, start, end.
  • A reference annotation in adata.obs: a column identifying normal cells (e.g., NK / Endothelial / Fibroblast).

Add gene coordinates from a GTF:

from cnvturbo.io import genomic_position_from_gtf

genomic_position_from_gtf(
    gtf_file="Homo_sapiens.GRCh38.110.gtf.gz",
    adata=adata,
)

2. R-compatible preprocessing (infercnv_r_compat)

Reproduces R inferCNV's 8-step pipeline exactly:

  1. Library-size normalization → median depth
  2. log2(x + 1)
  3. First reference subtraction (gene-space, "bounds" mode)
  4. Clip to ±3 (default)
  5. Per-chromosome same-length pyramid smoothing (window=101)
  6. Per-cell median centering
  7. Second reference subtraction (gene-space)
  8. 2^x → copy-ratio (neutral ≈ 1.0)
cnv_tl.infercnv_r_compat(
    adata,
    raw_layer="counts",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial"],
    max_ref_threshold=3.0,
    window_size=101,
    exclude_chromosomes=("chrX", "chrY"),
    apply_2x_transform=True,
    n_jobs=16,
    key_added="cnv",
)

Output:

  • adata.obsm["X_cnv"](n_cells × n_genes_filtered) copy-ratio matrix
  • adata.uns["cnv"]["chr_pos"] — gene-level chromosome offsets

3. hspike emission calibration (compute_hspike_emission_params)

Mirrors R's hidden_spike simulation: builds a synthetic genome (50% CNV / 50% neutral chromosomes), samples the simulation base from real reference cells, runs the full pipeline, and extracts emission parameters per CNV state.

emit_means, emit_stds = cnv_tl.compute_hspike_emission_params(
    adata,
    raw_layer="counts",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial"],
    n_sim_cells=100,
    n_genes_per_chr=400,
    output_space="copy_ratio",
)

4. HMM cell-level Tumor calling (hmm_call_subclusters)

R-equivalent decoder: per-group Leiden subclustering (cluster_by_groups=True, auto resolution), per-chromosome Viterbi with R's pnorm-based emission, segment-length denoise, "subcluster contains ≥1 CNV segment ⇒ Tumor" rule.

cnv_tl.hmm_call_subclusters(
    adata,
    use_rep="cnv",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial"],
    precomputed_emit_means=emit_means,
    precomputed_emit_stds=emit_stds,
    leiden_resolution="auto",
    cluster_by_groups=True,
    n_neighbors=20,
    n_pcs=10,
    min_segment_length=5,
    min_segments_for_tumor=1,
    use_r_viterbi=True,
    key_added="cnv_call",
    backend="auto",
    n_jobs=16,
)

Output (added to adata.obs):

  • cnv_call"Tumor" / "Normal" per cell
  • cnv_call_score — continuous CNV burden (mean(|X_cnv − 1.0|))
  • cnv_call_subcluster — Leiden subcluster id used for HMM

5. Visualization

cnv_tl.pca(adata, use_rep="cnv")
cnv_tl.umap(adata)
cnv_pl.chromosome_heatmap(adata, groupby="cnv_call")

import scanpy as sc
sc.pl.embedding(adata, basis="cnv_umap", color=["cnv_call", "cnv_call_score"])

Benchmark

Three pancreatic adenocarcinoma samples (P07 = 3,659 cells, P12 = 7,269 cells, P30 = 4,207 cells); reference group = NK + Endothelial + Fibroblast (~50% of all cells).

Sample R inferCNV (runtime) cnvturbo (runtime) Speed-up cnvturbo cell-level Accuracy vs R
P07CRX_T (3,659) 2.5 h 64 s 140× 1.000
P12HWZ_T (7,269) 5.0 h 86 s 210× 1.000
P30WJJ_T (4,207) 3.5 h 54 s 230× 1.000

cnvturbo's per-cell Tumor / Normal classification is identical to R inferCNV's HMM output across all 15,135 cells.

The "ground truth" was reconstructed directly from R's pred_cnv_regions.dat + cell_groupings to bypass a known fuzzy-match bug in some user post-processing scripts.


API overview

cnvturbo
├── tl                              # tools
│   ├── infercnv                    # original sliding-window scoring
│   ├── infercnv_r_compat           # R-exact 8-step pipeline (recommended)
│   ├── compute_hspike_emission_params  # hspike-based HMM emission calibration
│   ├── hmm_call_subclusters        # subcluster-level R-equivalent HMM caller
│   ├── hmm_call_cells              # cell-level HMM caller (no subclustering)
│   ├── cnv_score, cnv_score_cell   # CNV burden scores
│   ├── ithcna, ithgex              # intra-tumor heterogeneity
│   ├── pca, umap, tsne, leiden     # CNV-space embeddings (Scanpy wrappers)
│   └── copykat                     # CopyKAT integration (optional, requires R)
├── pp                              # preprocessing utilities
├── pl                              # plotting
├── io                              # GTF / genomic-position helpers
└── datasets                        # bundled tutorial data

Design highlights

  • R-exact pipeline: infercnv_r_compat reproduces the full 8 R inferCNV steps in gene-space copy-ratio (vs. window-space log2 used by older Python ports).
  • HMM i6 cell-level calling: hmm_call_subclusters reproduces R's HMM Viterbi decoder, denoising, and per-subcluster Tumor classification — typically absent from existing Python implementations.
  • Performance kernels: Numba parallel CPU / PyTorch GPU back-ends for sliding-window convolution and batched Viterbi (backend="auto" | "cpu" | "cuda").
  • Robust to reference contamination: emission std uses MAD (median absolute deviation) × 1.4826 instead of plain std, so reference cells contaminated by tumor cells don't inflate state widths.

A high-level infercnv / cnv_score / chromosome_heatmap API similar to the de facto Python convention is also exposed for ease of migration.


Citation

If you use cnvturbo in your research, please cite this implementation:

@software{cnvturbo,
  title  = {cnvturbo: GPU/Numba-accelerated scRNA-seq CNV inference with R inferCNV-compatible HMM i6},
  url    = {https://github.com/LogicByteCraft/cnvturbo},
  year   = {2026}
}

cnvturbo's algorithm is a faithful port of R inferCNV; please cite the upstream methodology as well when relevant.


License

BSD 3-Clause License — see LICENSE.

Acknowledgements

cnvturbo is inspired by and stays algorithmically aligned with:

  • inferCNV — reference R implementation of the HMM i6 pipeline.
  • Scanpy / AnnData — single-cell analysis ecosystem.

Contributing

Issues and pull requests are welcome at https://github.com/LogicByteCraft/cnvturbo. Before contributing:

pip install -e ".[dev,test]"
pre-commit install
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cnvturbo-0.1.0.tar.gz (4.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cnvturbo-0.1.0-py3-none-any.whl (3.8 MB view details)

Uploaded Python 3

File details

Details for the file cnvturbo-0.1.0.tar.gz.

File metadata

  • Download URL: cnvturbo-0.1.0.tar.gz
  • Upload date:
  • Size: 4.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cnvturbo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 122a64fc7f14beef3fa68f570829d9508bf2d6296269d31c19f6769aabf08cbc
MD5 b0a8eec10f3ae32d74c3e76a8bfdde86
BLAKE2b-256 e9ceffc53e0abaf9b3c45033717ea8cb191b7170478b5b8c0ceda48e2a1fee5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for cnvturbo-0.1.0.tar.gz:

Publisher: release.yaml on LogicByteCraft/cnvturbo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cnvturbo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cnvturbo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cnvturbo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 386fa8bf609bdd70b7d03754ee92ac3cd130c4fb30bb8f1fed2427c09b74e259
MD5 b6ab39b03577958fc891717ceb0b568b
BLAKE2b-256 50d612c96a32151e45cc0ba8e4b526db8160f5af7b3fb7ed5d5b37e1a118e4ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for cnvturbo-0.1.0-py3-none-any.whl:

Publisher: release.yaml on LogicByteCraft/cnvturbo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page