Pure-Python re-implementation of Seurat's CCA — canonical correlation analysis for single-cell integration, AnnData-native.
Project description
pyccasc
A pure-Python re-implementation of Seurat's RunCCA (Stuart, Butler, Hoffman, Hafemeister et al., Cell 2019) — canonical correlation analysis for single-cell integration. Drop-in for the scanpy / AnnData ecosystem.
The PyPI distribution is
pyccasc(CCA for single-cell); the Python import name iscca_py(sofrom cca_py import run_cca). The GitHub repo lives atomicverse/py-cca.
- AnnData-native — feeds directly into Scanpy / OmicVerse pipelines
- No
rpy2, no R install, no Rcpp toolchain - Numerical parity with
Seurat::RunCCAvalidated across 9 (size × num_cc) configurations: singular values match to ~1e-7, subspaces match to ~1e-3 (rotation within near-degenerate eigenspaces is the only source of difference)
Same upstream-mirror pattern as
pymclustR,monocle2-py,milor-py: the canonical implementation lives inomicverse; this repo is the standalone slice for users who want CCA without the full omicverse stack.
Install
pip install pyccasc
Quick-start
import numpy as np
from cca_py import run_cca
# X, Y are (n_features, n_cells) matrices with matched genes
X = np.random.randn(2000, 500) # batch 1: 2000 genes × 500 cells
Y = np.random.randn(2000, 700) # batch 2: 2000 genes × 700 cells
result = run_cca(X, Y, num_cc=30)
print(result.ccv.shape) # (1200, 30) — shared CC embedding
print(result.d.shape) # (30,) — singular values
u, v = result.split() # per-batch halves: (500, 30) and (700, 30)
AnnData adapter
from cca_py import run_cca_anndata
# adata1, adata2 are scanpy AnnData objects (cells × genes)
result = run_cca_anndata(adata1, adata2, num_cc=30, layer="log1p")
# adata1.obsm['X_cca'] now holds the (n_obs_1, 30) shared embedding
# adata2.obsm['X_cca'] holds the (n_obs_2, 30) embedding for the second batch
# adata.uns['cca'] carries the singular-value diagnostics
Algorithm
Direct port of Seurat::RunCCA.default (Seurat R/dimensional_reduction.R, lines 506–541):
object1 <- Standardize(object1) # z-score per cell (column)
object2 <- Standardize(object2)
mat3 <- crossprod(object1, object2) # cells_1 × cells_2 cross-cov
cca.svd <- irlba(mat3, nv = num.cc) # truncated SVD
ccv <- rbind(cca.svd$u, cca.svd$v) # (n1 + n2) × num.cc
# sign-flip each column so its first entry is non-negative
return(list(ccv = ccv, d = cca.svd$d))
We use scipy.sparse.linalg.svds (ARPACK) in place of irlba. Both are Lanczos-based and produce numerically equivalent top-k SVD truncations.
⚠️ Standardize gotcha: Seurat's
Standardize(insrc/data_manipulation.cpp) z-scores per column (per cell), not per row (per gene) — a non-obvious choice that's load-bearing for CCA correctness. We replicated it.
Module map
| Module | What it covers |
|---|---|
cca_py.cca |
core run_cca() + standardize() + l2_normalize() |
cca_py.anndata_adapter |
run_cca_anndata() for the scanpy / AnnData ecosystem |
Seurat parity
tests/r_parity_dump.R runs Seurat::RunCCA on three synthetic dataset sizes (small / medium / large) at three num_cc values (5 / 10 / 20). tests/test_r_parity.py then runs py-CCA's run_cca on the same inputs and asserts:
| Quantity | Tolerance |
|---|---|
| singular values (per-component relative error) | < 1e-5 |
| per-component embedding correlation | > 0.999 |
| Frobenius distance between the two column-span projectors | < 5e-3 |
All 9 configurations × 2 assertion families = 18 parity tests pass. To reproduce:
# in CMAP env (R + Seurat)
Rscript tests/r_parity_dump.R
# then in omicdev env
pytest tests/ -v
Roadmap
This first release covers the core SVD step of RunCCA. The full Seurat integration workflow uses CCA as the first step in FindIntegrationAnchors:
- ✅
RunCCA— shared CC embedding (this release) - ⏳
L2CCA— provided ascca_py.l2_normalize; integration with the result struct pending - ⏳
FindIntegrationAnchors— k-NN in CCA space → mutual nearest neighbours → anchor scoring - ⏳
IntegrateData— anchor-weighted correction of the expression matrix
PRs welcome.
Citation
If you use this package, please cite the original Seurat integration paper:
Stuart, T., Butler, A., Hoffman, P., Hafemeister, C. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019). https://doi.org/10.1016/j.cell.2019.05.031
and acknowledge omicverse / this repo for the Python port.
License
GNU GPLv3 — matches both upstream omicverse and Seurat.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyccasc-0.1.0.tar.gz.
File metadata
- Download URL: pyccasc-0.1.0.tar.gz
- Upload date:
- Size: 52.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
64a950402ca611980e755c7b4429dfd2298fc93f7a59f2ba97cc19324522ffbe
|
|
| MD5 |
5b9612d047f16ec19d23efca3c9898bc
|
|
| BLAKE2b-256 |
d5c3b6bb4e8df17186896c0a441191ffa899797fe8fadda8c6e12fce9ec1f4c3
|
File details
Details for the file pyccasc-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pyccasc-0.1.0-py3-none-any.whl
- Upload date:
- Size: 36.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.20
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f81d3a4d8c91d32342983d4129f340dfbca47aa8bbc371bd1be3706192bc49c8
|
|
| MD5 |
b02fd0cdb5e2f5f206eed80a4b380522
|
|
| BLAKE2b-256 |
03170666b8dcca19a96ea5a0d3b2494f4deccf02bdf3a27d4c7cb85166ce5e3b
|