Skip to main content

cnvturbo: A high-performance scRNA-seq CNV inference toolkit with R inferCNV-compatible HMM i6 cell-level tumor calling. The R-exact main pipeline runs on CPU + joblib (100-200x faster than R inferCNV); optional Numba/PyTorch CUDA back-ends accelerate the legacy infercnv + hmm_call_cells paths. Fully compatible with Scanpy/AnnData.

Project description

cnvturbo

PyPI version Python 3.10+ License: BSD-3 Scanpy compatible

cnvturbo — A Python re-implementation of R inferCNV for single-cell RNA-seq copy-number variation analysis. Algorithmically faithful to R inferCNV's HMM i6 pipeline, ~100× faster, and fully integrated with the Scanpy / AnnData ecosystem.

Rewritten in pure Python with R-exact algorithm alignment (hspike emission calibration, gene-level Viterbi in copy-ratio space, R-equivalent denoise + subcluster Tumor calling). The R-exact pipeline runs on CPU + joblib; optional Numba CPU / PyTorch CUDA kernels accelerate the legacy tl.infercnv and tl.hmm_call_cells paths.


Why cnvturbo?

Feature R inferCNV infercnvpy cnvturbo
Cell-level Tumor/Normal HMM ✗ (cluster score only)
HMM i6 + hspike emission ✓ (analytic + MAD-robust)
Per-chromosome Viterbi (copy-ratio)
Denoise (segment-length filter)
Reference subcluster handling partial
GPU / Numba acceleration ✓ (legacy tl.infercnv + tl.hmm_call_cells; R-exact path is CPU + joblib)
Runtime (P12, 7,269 cells) ~5 hr ~9 min ~86 s
Strict Tumor/Normal concordance with R 1.000 (ref) N/A (no cell-level HMM) F1 0.980

Verified on 40 PDAC samples (99,679 observation cells): region-level CNV calls are 100% identical to R inferCNV, strict cell-level Tumor/Normal calls reach overall F1 = 0.980, and per-cell continuous cnv_score matches R cnv_signal_R with mean Pearson 0.99997. See Benchmark below.

Speed-up attribution: the R-exact main pipeline (infercnv_r_compat + compute_hspike_emission_params + hmm_call_subclusters) is CPU + joblib only. All speed-up numbers in this README come from algorithmic rewrite + multi-core parallelism, not GPU. The optional GPU back-end currently only accelerates the legacy tl.infercnv (sliding-window scoring) and tl.hmm_call_cells (no-subcluster HMM) paths.


Installation

From PyPI (recommended)

pip install cnvturbo

With acceleration backends

These extras are only used by the legacy tl.infercnv and tl.hmm_call_cells paths (see Backend coverage). The R-exact main pipeline runs on stock CPU + joblib regardless of which extra you install.

# Numba CPU kernels (legacy `tl.infercnv` sliding-window + `tl.hmm_call_cells` Viterbi)
pip install "cnvturbo[hmm-cpu]"

# PyTorch CUDA back-end (same scope as above; falls back to CPU if no GPU)
pip install "cnvturbo[hmm-gpu]"

# Everything above + Baum-Welch EM emission fitting (`hmmlearn`)
pip install "cnvturbo[hmm]"

Development install

git clone https://github.com/LogicByteCraft/cnvturbo.git
cd cnvturbo
pip install -e ".[dev,test]"

Requirements

  • Python ≥ 3.10
  • scanpy ≥ 1.10, anndata ≥ 0.7.3, numpy ≥ 1.20, pandas ≥ 1
  • Optional accelerators (only effective for tl.infercnv + tl.hmm_call_cells — the R-exact pipeline does not use them):
    • numba ≥ 0.57 — Numba parallel CPU kernels for sliding-window convolution
    • torch ≥ 2.0 — PyTorch CUDA back-end for sliding-window conv1d + batched Viterbi
    • hmmlearn ≥ 0.3 — Baum-Welch EM emission fitting (fit_method="em")

Quick start

import scanpy as sc
import cnvturbo
from cnvturbo import tl as cnv_tl, pl as cnv_pl

adata = sc.read_h5ad("my_sample.h5ad")
adata.layers["counts"] = adata.X.copy()

cnv_tl.infercnv_r_compat(
    adata,
    raw_layer="counts",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial", "Fibroblast"],
    window_size=101,
    min_mean_expr_cutoff=0.1,    # R inferCNV default for 10x; use 1.0 for Smart-seq2
    apply_2x_transform=True,
    n_jobs=16,
)

emit_means, emit_stds, emit_sd_intercepts, emit_sd_slopes = cnv_tl.compute_hspike_emission_params(
    adata,
    raw_layer="counts",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial", "Fibroblast"],
    min_mean_expr_cutoff=0.1,    # 必须与 infercnv_r_compat 保持一致
    output_space="copy_ratio",
    return_sd_trend=True,
)

cnv_tl.hmm_call_subclusters(
    adata,
    use_rep="cnv",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial", "Fibroblast"],
    precomputed_emit_means=emit_means,
    precomputed_emit_stds=emit_stds,
    precomputed_emit_sd_intercepts=emit_sd_intercepts,
    precomputed_emit_sd_slopes=emit_sd_slopes,
    leiden_resolution="auto",
    cluster_by_groups=True,
    min_segment_length=5,
    min_segments_for_tumor=1,
    key_added="cnv_call",
    n_jobs=16,
)

print(adata.obs["cnv_call"].value_counts())

After this, adata.obs["cnv_call"] contains "Tumor" / "Normal" per cell, and adata.obs["cnv_call_score"] stores the HMM non-neutral state fraction (proportion_cnv).

For strict R-equivalent cell-level calls, combine the HMM burden with a continuous denoised CNV signal:

ref_mask = adata.obs["cell_type"].isin(["NK", "Endothelial", "Fibroblast"]).to_numpy()
x_denoise = cnv_tl.denoise_r_compat(adata.obsm["X_cnv"], ref_mask)
adata.obs["cnv_score"] = np.mean(np.abs(x_denoise - 1.0), axis=1)
adata.obs["proportion_cnv"] = adata.obs["cnv_call_score"].astype(float)
adata.obs["is_obs_tumor"] = (
    (~ref_mask)
    & (adata.obs["cnv_score"] > np.percentile(adata.obs.loc[ref_mask, "cnv_score"], 95))
    & (adata.obs["proportion_cnv"] > np.percentile(adata.obs.loc[ref_mask, "proportion_cnv"], 95))
)

End-to-end reusable scripts are available in template/.


Detailed usage

1. Prepare AnnData

cnvturbo requires:

  • Raw integer counts in adata.X or adata.layers["counts"].
  • Gene coordinates in adata.var: columns chromosome, start, end.
  • A reference annotation in adata.obs: a column identifying normal cells (e.g., NK / Endothelial / Fibroblast).

Add gene coordinates from a GTF:

from cnvturbo.io import genomic_position_from_gtf

genomic_position_from_gtf(
    gtf_file="Homo_sapiens.GRCh38.110.gtf.gz",
    adata=adata,
)

2. R-compatible preprocessing (infercnv_r_compat)

Reproduces R inferCNV's pipeline exactly:

  1. Low-expression gene filtermean(raw_count) < min_mean_expr_cutoff (R require_above_min_mean_expr_cutoff; 10x default 0.1, Smart-seq2 1.0)
  2. Library-size normalization → median depth
  3. log2(x + 1)
  4. First reference subtraction (gene-space, "bounds" mode)
  5. Clip to ±3 (default)
  6. Per-chromosome same-length pyramid smoothing (window=101)
  7. Per-cell median centering
  8. Second reference subtraction (gene-space)
  9. 2^x → copy-ratio (neutral ≈ 1.0)
cnv_tl.infercnv_r_compat(
    adata,
    raw_layer="counts",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial"],
    max_ref_threshold=3.0,
    window_size=101,
    exclude_chromosomes=("chrX", "chrY"),
    min_mean_expr_cutoff=0.1,    # R inferCNV default for 10x; set 1.0 for Smart-seq2; 0 to disable
    apply_2x_transform=True,
    n_jobs=16,
    key_added="cnv",
)

Output:

  • adata.obsm["X_cnv"](n_cells × n_genes_filtered) copy-ratio matrix
  • adata.uns["cnv"]["chr_pos"] — gene-level chromosome offsets
  • adata.uns["cnv"]["kept_var_names"] — original var_names that survived min_mean_expr_cutoff + chrX/chrY exclusion (matches obsm["X_cnv"] columns)
  • adata.uns["cnv"]["min_mean_expr_cutoff"] — actual cutoff applied (provenance)

3. hspike emission calibration (compute_hspike_emission_params)

Mirrors R's hidden_spike simulation: builds a synthetic genome (50% CNV / 50% neutral chromosomes), samples the simulation base from real reference cells, runs the full pipeline, and extracts emission parameters per CNV state.

emit_means, emit_stds, emit_sd_intercepts, emit_sd_slopes = cnv_tl.compute_hspike_emission_params(
    adata,
    raw_layer="counts",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial"],
    min_mean_expr_cutoff=0.1,    # 必须与 infercnv_r_compat 保持一致
    n_sim_cells=100,
    n_genes_per_chr=400,
    output_space="copy_ratio",
    return_sd_trend=True,
)

4. HMM cell-level Tumor calling (hmm_call_subclusters)

R-equivalent decoder: per-group Leiden subclustering (cluster_by_groups=True, auto resolution), per-chromosome Viterbi with R's pnorm-based emission, segment-length denoise, "subcluster contains ≥1 CNV segment ⇒ Tumor" rule.

cnv_tl.hmm_call_subclusters(
    adata,
    use_rep="cnv",
    reference_key="cell_type",
    reference_cat=["NK", "Endothelial"],
    precomputed_emit_means=emit_means,
    precomputed_emit_stds=emit_stds,
    precomputed_emit_sd_intercepts=emit_sd_intercepts,
    precomputed_emit_sd_slopes=emit_sd_slopes,
    leiden_resolution="auto",
    cluster_by_groups=True,
    z_score_filter=0.8,
    leiden_function="CPM",
    leiden_graph_method="seurat_snn",
    n_neighbors=20,
    n_pcs=10,
    min_segment_length=5,
    min_segments_for_tumor=1,
    use_r_viterbi=True,
    key_added="cnv_call",
    backend="auto",
    n_jobs=16,
)

Output (added to adata.obs):

  • cnv_call"Tumor" / "Normal" per cell
  • cnv_call_score — HMM non-neutral state fraction (proportion_cnv)
  • cnv_call_expr_deviation — raw expression deviation (mean(|X_cnv − 1.0|))
  • cnv_call_subcluster — Leiden subcluster id used for HMM

5. Visualization

cnv_tl.pca(adata, use_rep="cnv")
cnv_tl.umap(adata)
cnv_pl.chromosome_heatmap(adata, groupby="cnv_call")

import scanpy as sc
sc.pl.embedding(adata, basis="cnv_umap", color=["cnv_call", "cnv_call_score"])

Benchmark

Pancreatic adenocarcinoma benchmark, 40 samples, 99,679 observation cells; reference group = NK / T-like normal cells depending on sample annotation. R inferCNV outputs were used only for validation, not as cnvturbo inputs.

Metric Result
Region-level CNV call accuracy vs R 1.000
Region-level CNV call F1 vs R 1.000
Strict cell-level Tumor/Normal accuracy vs R 0.986
Strict cell-level Tumor/Normal precision vs R 0.976
Strict cell-level Tumor/Normal recall vs R 0.984
Strict cell-level Tumor/Normal F1 vs R 0.980
Per-cell cnv_score mean Pearson vs R cnv_signal_R 0.99997
Per-cell cnv_score max RMSE vs R cnv_signal_R 1.24e-4

The strict call is the dual-gate rule used by the templates: cnv_score > P95(reference) and proportion_cnv > P95(reference).


API overview

cnvturbo
├── tl                              # tools
│   ├── infercnv                    # original sliding-window scoring
│   ├── infercnv_r_compat           # R-exact 8-step pipeline (recommended)
│   ├── compute_hspike_emission_params  # hspike-based HMM emission calibration
│   ├── hmm_call_subclusters        # subcluster-level R-equivalent HMM caller
│   ├── hmm_call_cells              # cell-level HMM caller (no subclustering)
│   ├── cnv_score, cnv_score_cell   # CNV burden scores
│   ├── ithcna, ithgex              # intra-tumor heterogeneity
│   ├── pca, umap, tsne, leiden     # CNV-space embeddings (Scanpy wrappers)
│   └── copykat                     # CopyKAT integration (optional, requires R)
├── pp                              # preprocessing utilities
├── pl                              # plotting
├── io                              # GTF / genomic-position helpers
└── datasets                        # bundled tutorial data

Design highlights

  • R-exact pipeline: infercnv_r_compat reproduces the full 8 R inferCNV steps in gene-space copy-ratio (vs. window-space log2 used by older Python ports).
  • HMM i6 cell-level calling: hmm_call_subclusters reproduces R's HMM Viterbi decoder, denoising, and per-subcluster Tumor classification — typically absent from existing Python implementations.
  • Performance kernels: Numba parallel CPU + PyTorch CUDA back-ends for the legacy tl.infercnv (sliding-window conv1d) and tl.hmm_call_cells (batched Viterbi) paths (backend="auto" | "cpu" | "cuda"). The R-exact path (infercnv_r_compat + compute_hspike_emission_params + hmm_call_subclusters) currently runs on CPU + joblib only — see Backend coverage below.
  • Robust to reference contamination: emission std uses MAD (median absolute deviation) × 1.4826 instead of plain std, so reference cells contaminated by tumor cells don't inflate state widths.

A high-level infercnv / cnv_score / chromosome_heatmap API similar to the de facto Python convention is also exposed for ease of migration.

Backend coverage

Function Numba CPU PyTorch CUDA Notes
tl.infercnv (legacy sliding-window scoring) backend="auto" picks GPU when available
tl.hmm_call_cells (cell-level HMM, no subcluster) same
tl.infercnv_r_compat (R-exact 8-step pipeline) CPU + joblib (n_jobs); no GPU code path
tl.compute_hspike_emission_params same
tl.hmm_call_subclusters (R-exact subcluster HMM) use_r_viterbi=True (default) is hard-wired to the R-pnorm CPU Viterbi; backend argument is currently a no-op on this path

Practical implication. If you follow the recommended infercnv_r_compat

  • hmm_call_subclusters workflow, install cnvturbo without any accelerator extra and tune n_jobs / OMP_NUM_THREADS for CPU throughput. GPU extras only help if you use the legacy tl.infercnv / tl.hmm_call_cells paths. Wiring the R-exact subcluster Viterbi onto GPU is on the roadmap.

Citation

If you use cnvturbo in your research, please cite this implementation:

@software{cnvturbo,
  title  = {cnvturbo: A high-performance scRNA-seq CNV inference toolkit with R inferCNV-compatible HMM i6 (CPU + optional GPU back-ends)},
  url    = {https://github.com/LogicByteCraft/cnvturbo},
  year   = {2026}
}

cnvturbo's algorithm is a faithful port of R inferCNV; please cite the upstream methodology as well when relevant.


License

BSD 3-Clause License — see LICENSE.

Acknowledgements

cnvturbo is inspired by and stays algorithmically aligned with:

  • inferCNV — reference R implementation of the HMM i6 pipeline.
  • Scanpy / AnnData — single-cell analysis ecosystem.

Contributing

Issues and pull requests are welcome at https://github.com/LogicByteCraft/cnvturbo. Before contributing:

pip install -e ".[dev,test]"
pre-commit install
pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cnvturbo-0.3.0.tar.gz (4.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cnvturbo-0.3.0-py3-none-any.whl (3.8 MB view details)

Uploaded Python 3

File details

Details for the file cnvturbo-0.3.0.tar.gz.

File metadata

  • Download URL: cnvturbo-0.3.0.tar.gz
  • Upload date:
  • Size: 4.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cnvturbo-0.3.0.tar.gz
Algorithm Hash digest
SHA256 fd183af88ed6dbe712bb33196b68abead2782328e1ed0328b3efb935d7fe2835
MD5 f9be9f555dbb50b18ae453a135003cda
BLAKE2b-256 31292a705214d9a8f145f38ac59aa3dc7088b7ea4c4b262bc0d4f6be1270b801

See more details on using hashes here.

Provenance

The following attestation bundles were made for cnvturbo-0.3.0.tar.gz:

Publisher: release.yaml on LogicByteCraft/cnvturbo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file cnvturbo-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: cnvturbo-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 3.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cnvturbo-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f35dccc74c05b453ea58602e90b448f6d984c1a0b0293d545532898c37d99c6d
MD5 57a716f82a9004fffdd040cb41b453a1
BLAKE2b-256 2a32b07d75da272390f62efe622c2b90ee8e921d27ceac232255aceed21d2928

See more details on using hashes here.

Provenance

The following attestation bundles were made for cnvturbo-0.3.0-py3-none-any.whl:

Publisher: release.yaml on LogicByteCraft/cnvturbo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page