Secreted Protein Activity Inference using Ridge Regression
Project description
SecActPy
Secreted Protein Activity Inference using Ridge Regression
Python implementation of SecAct for inferring secreted protein activities from gene expression data.
Key Features:
- 🎯 SecAct Compatible: Produces identical results to the R SecAct/RidgeR package
- 🚀 GPU Acceleration: Optional CuPy backend for large-scale analysis
- 📊 Million-Sample Scale: Batch processing with streaming output for massive datasets
- 🔬 Built-in Signatures: Includes SecAct and CytoSig signature matrices
- 🧬 Multi-Platform Support: Bulk RNA-seq, scRNA-seq, and Spatial Transcriptomics (Visium, CosMx)
- 💾 Smart Caching: Optional permutation table caching for faster repeated analyses
- 🧮 Sparse-Aware: Automatic memory-efficient processing for sparse single-cell data
Installation
Recommended: Create a virtual environment before installing to avoid dependency conflicts with other packages.
python -m venv secactpy-env source secactpy-env/bin/activate # Linux/macOS # secactpy-env\Scripts\activate # Windows
From PyPI (Recommended)
# CPU Only
pip install secactpy
# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu]"
# With GPU Support (CUDA 12.x)
pip install secactpy
pip install cupy-cuda12x
From GitHub
# CPU Only
pip install git+https://github.com/data2intelligence/SecActpy.git
# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu] @ git+https://github.com/data2intelligence/SecActpy.git"
# With GPU Support (CUDA 12.x)
pip install git+https://github.com/data2intelligence/SecActpy.git
pip install cupy-cuda12x
Important (CUDA 12.x users): Do not use the
[gpu]extra on CUDA 12.x systems — it installscupy-cuda11x, which conflicts withcupy-cuda12x. If you already installed with[gpu], remove the conflicting package first:pip uninstall cupy-cuda11x pip install cupy-cuda12x
Development Installation
git clone https://github.com/data2intelligence/SecActpy.git
cd SecActpy
pip install -e ".[dev]"
Quick Start
Example Data
Example datasets for all Quick Start tutorials are available on Zenodo:
| Example | Input File | Output File | Size |
|---|---|---|---|
| Bulk RNA-seq | Ly86-Fc_vs_Vehicle_logFC.txt |
Ly86-Fc_vs_Vehicle_logFC_output.h5ad |
0.5 MB |
| scRNA-seq (OV CD4 T cells) | OV_scRNAseq_CD4.h5ad |
OV_scRNAseq_ct_CD4_output.h5ad, OV_scRNAseq_sc_CD4_output.h5ad |
34 MB |
| Visium ST (HCC) | Visium_HCC_data.h5ad |
Visium_HCC_output.h5ad |
255 MB |
| CosMx (LIHC) | LIHC_CosMx_data.h5ad |
LIHC_CosMx_output.h5ad |
3.0 GB |
Download all example files:
# Download individual files from Zenodo
wget https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
wget https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
wget https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad
Example 1: Bulk RNA-seq
import pandas as pd
from secactpy import secact_activity_inference
# Load differential expression data (genes × samples)
# Download: https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
diff_expr = pd.read_csv("Ly86-Fc_vs_Vehicle_logFC.txt", sep=r"\s+", index_col=0)
# Run inference
result = secact_activity_inference(
diff_expr,
is_differential=True,
sig_matrix="secact", # or "cytosig"
verbose=True
)
# Access results
activity = result['zscore'] # Activity z-scores
pvalues = result['pvalue'] # P-values
coefficients = result['beta'] # Regression coefficients
Note: Set
is_differential=Truewhen the input is already log fold-change data. For single-column input with no control, row-mean centering is automatically skipped (it would produce all zeros).
Example 2: scRNA-seq Analysis
import anndata as ad
from secactpy import secact_activity_inference_scrnaseq
# Load scRNA-seq data (788 OV CD4 T cells, 3 subtypes)
# Download: https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")
# Pseudo-bulk by cell type
result = secact_activity_inference_scrnaseq(
adata,
cell_type_col="Annotation",
is_single_cell_level=False,
verbose=True
)
# Single-cell level
result_sc = secact_activity_inference_scrnaseq(
adata,
cell_type_col="Annotation",
is_single_cell_level=True,
verbose=True
)
Example 3: Spatial Transcriptomics
Visium (spot-level)
from secactpy import secact_activity_inference_st
# Load Visium HCC data (3,415 spots)
# Download: https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
result = secact_activity_inference_st(
"Visium_HCC_data.h5ad",
min_genes=1000,
verbose=True
)
activity = result['zscore'] # (proteins × spots)
CosMx (single-cell spatial)
import anndata as ad
from secactpy import secact_activity_inference_st
# Load CosMx LIHC data (443,515 cells, 1,000 genes, 12 cell types)
# Download: https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad
adata = ad.read_h5ad("LIHC_CosMx_data.h5ad")
# Cell-type resolution (pseudo-bulk by cell type)
result = secact_activity_inference_st(
adata,
cell_type_col="cellType", # Column in adata.obs
is_spot_level=False, # Aggregate by cell type
verbose=True
)
activity = result['zscore'] # (proteins × cell_types)
Large-Scale Batch Processing
What is batch processing?
By default, SecActPy loads the entire expression matrix into memory and runs ridge regression on all samples at once. This works well for most datasets, but for large-scale analyses (e.g., 50,000+ single cells or spatial spots) the memory required for permutation testing can exceed available RAM or GPU memory.
Batch processing splits the work into smaller pieces. The expensive projection
matrix T = (X'X + λI)^{-1} X' is computed once from the signature, then
samples are processed in chunks of batch_size at a time. Each chunk goes
through the full permutation-testing pipeline independently, and partial results
are concatenated at the end. The final output is mathematically identical to
processing all samples at once — only peak memory usage is reduced.
All three high-level functions support batch_size and output_path:
secact_activity_inference()— bulk RNA-seqsecact_activity_inference_scrnaseq()— scRNA-seqsecact_activity_inference_st()— spatial transcriptomics
Set batch_size to enable it:
# Without batch processing: all samples at once (default)
result = secact_activity_inference(expr_df, ...)
# With batch processing: 5000 samples per chunk
result = secact_activity_inference(expr_df, ..., batch_size=5000)
# Works the same way for scRNA-seq and ST:
result = secact_activity_inference_scrnaseq(adata, ..., batch_size=5000)
result = secact_activity_inference_st(adata, ..., batch_size=5000)
In-memory vs streaming output
By default, batch results are accumulated in memory and returned as a dictionary
of DataFrames — this is the in-memory mode. You get back a dict with
result['zscore'], result['pvalue'], etc., just like the non-batched case.
For very large datasets, even the output matrices (beta, zscore, pvalue,
se — each of shape n_proteins × n_samples) may not fit in memory. Streaming
output solves this: set output_path to write each batch's results directly
to an HDF5 file on disk as it completes. The function returns None in this
mode — no results are held in memory. You load them back from the file when
needed. All three high-level functions support this.
| Mode | Parameter | Return value | Memory for output |
|---|---|---|---|
| In-memory (default) | output_path=None |
dict of DataFrames |
All results in RAM |
| Streaming | output_path="results.h5ad" |
None |
Only one batch at a time |
# Streaming works with any high-level function:
secact_activity_inference(..., batch_size=5000, output_path="bulk_results.h5ad")
secact_activity_inference_scrnaseq(..., batch_size=5000, output_path="sc_results.h5ad")
secact_activity_inference_st(..., batch_size=5000, output_path="st_results.h5ad")
Example: batch processing with secact_activity_inference
secact_activity_inference handles gene subsetting, z-score normalization,
signature grouping, and row expansion automatically — you just pass your
expression data and set batch_size.
# Download example data (788 OV CD4 T cells, 34 MB)
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
from secactpy import secact_activity_inference
import anndata as ad
# Load multi-sample expression data
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")
# --- In-memory mode (default) ---
# Results are returned as a dict of DataFrames
result = secact_activity_inference(
adata.to_df().T, # genes × cells DataFrame
is_differential=False, # center by row means across samples
batch_size=200, # process 200 cells per batch
verbose=True
)
print(result['zscore'].head()) # (proteins × cells) DataFrame
# --- Streaming mode ---
# Results are written to disk; function returns None
secact_activity_inference(
adata.to_df().T,
is_differential=False,
batch_size=200,
output_path="results.h5ad", # write here instead of returning
output_compression="gzip", # compress on disk (default)
verbose=True
)
# Load results back when needed:
import h5py
with h5py.File("results.h5ad", "r") as f:
zscore = f['zscore'][:] # NumPy array (proteins × cells)
Advanced: ridge_batch for full control
The high-level secact_activity_inference handles gene subsetting, scaling,
centering, and streaming output automatically. If you need more control — for
example, to pass a sparse matrix directly or skip normalization — use the
lower-level ridge_batch function.
Why dense and sparse inputs are handled differently. ridge_batch processes
Y in chunks and needs whole-column statistics (mean and standard deviation) for
z-score normalization. How it gets those statistics depends on the input format:
- Dense (NumPy array): The function cannot compute whole-column statistics because it only sees one chunk at a time, and the full array may be too large to scan upfront. You must z-score normalize Y yourself before calling.
- Sparse (
scipy.sparsematrix): Computing column means and standard deviations from a sparse matrix is cheap (no dense conversion needed), so the function does this automatically upfront, then applies z-score normalization on-the-fly within each batch. This is done because sparse matrices cannot be z-scored in place without losing sparsity — the result would be fully dense.
If you do not want automatic sparse scaling, convert to dense first and normalize however you like (or not at all):
from secactpy import ridge_batch
# Opt out of auto-scaling: convert sparse to dense, apply your own processing
Y_dense = Y_sparse.toarray().astype(np.float64)
# ... apply your own normalization (or skip it) ...
result = ridge_batch(X, Y_dense, batch_size=5000)
API Reference
High-Level Functions
All three inference functions support batch_size, output_path, and
output_compression for large-scale and streaming workflows.
| Function | Description |
|---|---|
secact_activity_inference() |
Bulk RNA-seq inference |
secact_activity_inference_scrnaseq() |
scRNA-seq inference |
secact_activity_inference_st() |
Spatial transcriptomics inference |
load_signature(name='secact') |
Load built-in signature matrix |
Core Functions
| Function | Description |
|---|---|
ridge() |
Single-call ridge regression with permutation testing |
ridge_batch() |
Batch processing for large datasets (dense or sparse) |
estimate_batch_size() |
Estimate optimal batch size for available memory |
estimate_memory() |
Estimate memory requirements |
Key Parameters
| Parameter | Default | Description |
|---|---|---|
sig_matrix |
"secact" |
Signature: "secact", "cytosig", or DataFrame |
lambda_ |
5e5 |
Ridge regularization parameter |
n_rand |
1000 |
Number of permutations |
seed |
0 |
Random seed for reproducibility |
backend |
'auto' |
'auto', 'numpy', or 'cupy' |
use_cache |
False |
Cache permutation tables to disk |
ST-Specific Parameters
| Parameter | Default | Description |
|---|---|---|
cell_type_col |
None |
Column in AnnData.obs for cell type |
is_spot_level |
True |
If False, aggregate by cell type |
scale_factor |
1e5 |
Normalization scale factor |
Batch Processing Parameters
Supported by all three high-level inference functions and ridge_batch().
| Parameter | Default | Description |
|---|---|---|
batch_size |
None |
Samples per batch (None = all at once) |
output_path |
None |
Stream results to H5AD file (requires batch_size) |
output_compression |
"gzip" |
Compression: "gzip", "lzf", or None |
GPU Acceleration
from secactpy import secact_activity_inference, CUPY_AVAILABLE
print(f"GPU available: {CUPY_AVAILABLE}")
# Auto-detect GPU
result = secact_activity_inference(expression, backend='auto')
# Force GPU
result = secact_activity_inference(expression, backend='cupy')
Performance
| Dataset | R (Mac M1) | R (Linux) | Py (CPU) | Py (GPU) | Speedup |
|---|---|---|---|---|---|
| Bulk (1,170 sp × 1,000 samples) | 74.4s | 141.6s | 128.8s | 6.7s | 11–19x |
| scRNA-seq (1,170 sp × 788 cells) | 54.9s | 117.4s | 104.8s | 6.8s | 8–15x |
| Visium (1,170 sp × 3,404 spots) | 141.7s | 379.8s | 381.4s | 11.2s | 13–34x |
| CosMx (151 sp × 443,515 cells) | 936.9s | 976.1s | 1226.7s | 99.9s | 9–12x |
Benchmark Environment
- Mac CPU: M1 Pro with VECLIB (8 cores)
- Linux CPU: AMD EPYC 7543P (4 cores)
- Linux GPU: NVIDIA A100-SXM4-80GB
Command Line Interface
SecActPy provides a command line interface for common workflows:
# Bulk RNA-seq (differential expression)
secactpy bulk -i diff_expr.tsv -o results.h5ad --differential -v
# Bulk RNA-seq (raw counts)
secactpy bulk -i counts.tsv -o results.h5ad -v
# scRNA-seq with cell type aggregation
secactpy scrnaseq -i data.h5ad -o results.h5ad --cell-type-col celltype -v
# scRNA-seq at single cell level
secactpy scrnaseq -i data.h5ad -o results.h5ad --single-cell -v
# Visium spatial transcriptomics
secactpy visium -i /path/to/visium/ -o results.h5ad -v
# CosMx (single-cell spatial)
secactpy cosmx -i cosmx.h5ad -o results.h5ad --batch-size 50000 -v
# Use GPU acceleration
secactpy bulk -i data.tsv -o results.h5ad --backend cupy -v
# Use CytoSig signature
secactpy bulk -i data.tsv -o results.h5ad --signature cytosig -v
CLI Options
| Option | Description |
|---|---|
-i, --input |
Input file or directory |
-o, --output |
Output H5AD file |
-s, --signature |
Signature matrix (secact, cytosig) |
--lambda |
Ridge regularization (default: 5e5) |
-n, --n-rand |
Number of permutations (default: 1000) |
--backend |
Computation backend (auto, numpy, cupy) |
--batch-size |
Batch size for large datasets |
-v, --verbose |
Verbose output |
Docker
Pre-built Docker images are available:
# CPU version
docker pull psychemistz/secactpy:latest
# GPU version
docker pull psychemistz/secactpy:gpu
# With R SecAct/RidgeR for cross-validation
docker pull psychemistz/secactpy:with-r
See DOCKER.md for detailed usage instructions.
Reproducibility
SecActPy produces identical results to R SecAct/RidgeR:
result = secact_activity_inference(
expression,
is_differential=True,
sig_matrix="secact",
lambda_=5e5,
n_rand=1000,
seed=0,
use_gsl_rng=True # Default: R-compatible RNG
)
For faster analysis (when R compatibility is not required):
result = secact_activity_inference(
expression,
use_gsl_rng=False, # ~70x faster permutation generation
)
Requirements
- Python ≥ 3.9
- NumPy ≥ 1.20
- Pandas ≥ 1.3
- SciPy ≥ 1.7
- h5py ≥ 3.0
- anndata ≥ 0.8
- scanpy ≥ 1.9
Optional: CuPy ≥ 10.0 (GPU acceleration)
Citation
If you use SecActPy in your research, please cite:
Beibei Ru, Lanqi Gong, Emily Yang, Seongyong Park, George Zaki, Kenneth Aldape, Lalage Wakefield, Peng Jiang. Inference of secreted protein activities in intercellular communication. GitHub: data2intelligence/SecAct
Related Projects
- SecAct - Original R implementation
- RidgeR - R ridge regression package
- SpaCET - Spatial transcriptomics cell type analysis
- CytoSig - Cytokine signaling inference
License
MIT License - see LICENSE for details.
Changelog
v0.2.1
- Streaming output (
output_path) support in all high-level inference functions use_gsl_rngsupport inridge_batch()for ~70x faster permutation generation- Fixed
use_gsl_rngbeing silently ignored in batch processing - Expanded batch processing documentation with examples and downloadable data
v0.2.0 (Official Release)
- Official release under data2intelligence organization
- PyPI package available (
pip install secactpy) - Comprehensive test suite and CI/CD pipeline
- Docker images with GPU and R support
v0.1.2 (Initial Development)
- Ridge regression with permutation-based significance testing
- GPU acceleration via CuPy backend (9–34x speedup)
- Batch processing with streaming H5AD output for million-sample datasets
- Automatic sparse matrix handling in
ridge_batch() - Built-in SecAct and CytoSig signature matrices
- GSL-compatible RNG for R/RidgeR reproducibility
- Support for Bulk RNA-seq, scRNA-seq, and Spatial Transcriptomics
- Cell type resolution for ST data (
cell_type_col,is_spot_level) - Optional permutation table caching (
use_cache)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file secactpy-0.2.1.tar.gz.
File metadata
- Download URL: secactpy-0.2.1.tar.gz
- Upload date:
- Size: 81.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ec20dd3715002d1b702c29d091ff24681a04922557fddc478f496129bc27a92
|
|
| MD5 |
68f4e68390bec4da41d600d18b47654f
|
|
| BLAKE2b-256 |
0ac7e908e5132b042235419c9a5d95b05907e3b4a29d442de050d3b83c06fb0d
|
File details
Details for the file secactpy-0.2.1-py3-none-any.whl.
File metadata
- Download URL: secactpy-0.2.1-py3-none-any.whl
- Upload date:
- Size: 81.7 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
537e7d8cd6ddb23716000b61f3fe6f31b04186bc44ef7c726487820edbb201a5
|
|
| MD5 |
700216eb2fd8497f685c92ebd24795d4
|
|
| BLAKE2b-256 |
f6320adb0e1a24856128f12fcd530131c82292bd39f7a67a28bc0a0870b5111f
|