Skip to main content

Pure-Python re-implementation of Seurat's CCA — canonical correlation analysis for single-cell integration, AnnData-native.

Project description

pyccasc

PyPI Python License: GPL v3

A pure-Python re-implementation of Seurat's RunCCA (Stuart, Butler, Hoffman, Hafemeister et al., Cell 2019) — canonical correlation analysis for single-cell integration. Drop-in for the scanpy / AnnData ecosystem.

The PyPI distribution is pyccasc (CCA for single-cell); the Python import name is cca_py (so from cca_py import run_cca). The GitHub repo lives at omicverse/py-cca.

  • AnnData-native — feeds directly into Scanpy / OmicVerse pipelines
  • No rpy2, no R install, no Rcpp toolchain
  • Numerical parity with Seurat::RunCCA validated across 9 (size × num_cc) configurations: singular values match to ~1e-7, subspaces match to ~1e-3 (rotation within near-degenerate eigenspaces is the only source of difference)

Same upstream-mirror pattern as pymclustR, monocle2-py, milor-py: the canonical implementation lives in omicverse; this repo is the standalone slice for users who want CCA without the full omicverse stack.

Install

pip install pyccasc

Quick-start

import numpy as np
from cca_py import run_cca

# X, Y are (n_features, n_cells) matrices with matched genes
X = np.random.randn(2000, 500)   # batch 1: 2000 genes × 500 cells
Y = np.random.randn(2000, 700)   # batch 2: 2000 genes × 700 cells

result = run_cca(X, Y, num_cc=30)
print(result.ccv.shape)          # (1200, 30) — shared CC embedding
print(result.d.shape)            # (30,)      — singular values

u, v = result.split()             # per-batch halves: (500, 30) and (700, 30)

AnnData adapter

from cca_py import run_cca_anndata

# adata1, adata2 are scanpy AnnData objects (cells × genes)
result = run_cca_anndata(adata1, adata2, num_cc=30, layer="log1p")

# adata1.obsm['X_cca'] now holds the (n_obs_1, 30) shared embedding
# adata2.obsm['X_cca'] holds the (n_obs_2, 30) embedding for the second batch
# adata.uns['cca'] carries the singular-value diagnostics

Algorithm

Direct port of Seurat::RunCCA.default (Seurat R/dimensional_reduction.R, lines 506–541):

object1 <- Standardize(object1)        # z-score per cell (column)
object2 <- Standardize(object2)
mat3    <- crossprod(object1, object2) # cells_1 × cells_2 cross-cov
cca.svd <- irlba(mat3, nv = num.cc)    # truncated SVD
ccv     <- rbind(cca.svd$u, cca.svd$v) # (n1 + n2) × num.cc
# sign-flip each column so its first entry is non-negative
return(list(ccv = ccv, d = cca.svd$d))

We use scipy.sparse.linalg.svds (ARPACK) in place of irlba. Both are Lanczos-based and produce numerically equivalent top-k SVD truncations.

⚠️ Standardize gotcha: Seurat's Standardize (in src/data_manipulation.cpp) z-scores per column (per cell), not per row (per gene) — a non-obvious choice that's load-bearing for CCA correctness. We replicated it.

Module map

Module What it covers
cca_py.cca core run_cca() + standardize() + l2_normalize()
cca_py.anndata_adapter run_cca_anndata() for the scanpy / AnnData ecosystem

Seurat parity

tests/r_parity_dump.R runs Seurat::RunCCA on three synthetic dataset sizes (small / medium / large) at three num_cc values (5 / 10 / 20). tests/test_r_parity.py then runs py-CCA's run_cca on the same inputs and asserts:

Quantity Tolerance
singular values (per-component relative error) < 1e-5
per-component embedding correlation > 0.999
Frobenius distance between the two column-span projectors < 5e-3

All 9 configurations × 2 assertion families = 18 parity tests pass. To reproduce:

# in CMAP env (R + Seurat)
Rscript tests/r_parity_dump.R

# then in omicdev env
pytest tests/ -v

Roadmap

This first release covers the core SVD step of RunCCA. The full Seurat integration workflow uses CCA as the first step in FindIntegrationAnchors:

  1. RunCCA — shared CC embedding (this release)
  2. L2CCA — provided as cca_py.l2_normalize; integration with the result struct pending
  3. FindIntegrationAnchors — k-NN in CCA space → mutual nearest neighbours → anchor scoring
  4. IntegrateData — anchor-weighted correction of the expression matrix

PRs welcome.

Citation

If you use this package, please cite the original Seurat integration paper:

Stuart, T., Butler, A., Hoffman, P., Hafemeister, C. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). https://doi.org/10.1016/j.cell.2019.05.031

and acknowledge omicverse / this repo for the Python port.

License

GNU GPLv3 — matches both upstream omicverse and Seurat.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyccasc-0.1.0.tar.gz (52.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyccasc-0.1.0-py3-none-any.whl (36.3 kB view details)

Uploaded Python 3

File details

Details for the file pyccasc-0.1.0.tar.gz.

File metadata

  • Download URL: pyccasc-0.1.0.tar.gz
  • Upload date:
  • Size: 52.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pyccasc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 64a950402ca611980e755c7b4429dfd2298fc93f7a59f2ba97cc19324522ffbe
MD5 5b9612d047f16ec19d23efca3c9898bc
BLAKE2b-256 d5c3b6bb4e8df17186896c0a441191ffa899797fe8fadda8c6e12fce9ec1f4c3

See more details on using hashes here.

File details

Details for the file pyccasc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pyccasc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 36.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for pyccasc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f81d3a4d8c91d32342983d4129f340dfbca47aa8bbc371bd1be3706192bc49c8
MD5 b02fd0cdb5e2f5f206eed80a4b380522
BLAKE2b-256 03170666b8dcca19a96ea5a0d3b2494f4deccf02bdf3a27d4c7cb85166ce5e3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page