Secreted Protein Activity Inference using Ridge Regression
Project description
SecActPy
Secreted Protein Activity Inference using Ridge Regression
Python implementation of SecAct for inferring secreted protein activities from gene expression data.
Key Features:
- SecAct Compatible: Matches R SecAct/RidgeR results on the same platform (
rng_method='srand') - GPU Acceleration: Optional CuPy backend for large-scale analysis
- Million-Sample Scale: Batch processing with streaming output for massive datasets
- Streaming H5AD: Two-pass chunk reading for >5M-cell datasets without loading the full matrix (~3 GB peak vs ~200 GB)
- Built-in Signatures: Includes SecAct and CytoSig signature matrices
- Multi-Platform Support: Bulk RNA-seq, scRNA-seq, and Spatial Transcriptomics (Visium, CosMx)
- Smart Caching: Optional permutation table caching for faster repeated analyses
- Sparse-Aware: Automatic memory-efficient processing for sparse single-cell data
Installation
Recommended: Create a virtual environment before installing to avoid dependency conflicts with other packages.
python -m venv secactpy-env source secactpy-env/bin/activate # Linux/macOS # secactpy-env\Scripts\activate # Windows
From PyPI (Recommended)
# CPU Only
pip install secactpy
# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu]"
# With GPU Support (CUDA 12.x)
pip install secactpy
pip install cupy-cuda12x
From GitHub
# CPU Only
pip install git+https://github.com/data2intelligence/SecActpy.git
# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu] @ git+https://github.com/data2intelligence/SecActpy.git"
# With GPU Support (CUDA 12.x)
pip install git+https://github.com/data2intelligence/SecActpy.git
pip install cupy-cuda12x
Development Installation
git clone https://github.com/data2intelligence/SecActpy.git
cd SecActpy
pip install -e ".[dev]"
Quick Start
Example Data
Example datasets for all Quick Start tutorials are available on Zenodo:
| Example | Input File | Output File | Size |
|---|---|---|---|
| Bulk RNA-seq | Ly86-Fc_vs_Vehicle_logFC.txt |
Ly86-Fc_vs_Vehicle_logFC_output.h5ad |
0.5 MB |
| scRNA-seq (OV CD4 T cells) | OV_scRNAseq_CD4.h5ad |
OV_scRNAseq_ct_CD4_output.h5ad, OV_scRNAseq_sc_CD4_output.h5ad |
34 MB |
| Visium ST (HCC) | Visium_HCC_data.h5ad |
Visium_HCC_output.h5ad |
255 MB |
| CosMx (LIHC) | LIHC_CosMx_data.h5ad |
LIHC_CosMx_output.h5ad |
3.0 GB |
Download all example files:
# Download individual files from Zenodo
wget https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
wget https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
wget https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad
Example 1: Bulk RNA-seq
import pandas as pd
from secactpy import secact_activity_inference
# Load differential expression data (genes × samples)
# Download: https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
diff_expr = pd.read_csv("Ly86-Fc_vs_Vehicle_logFC.txt", sep=r"\s+", index_col=0)
# Run inference
result = secact_activity_inference(
diff_expr,
is_differential=True,
sig_matrix="secact", # or "cytosig"
verbose=True
)
# Access results
activity = result['zscore'] # Activity z-scores
pvalues = result['pvalue'] # P-values
coefficients = result['beta'] # Regression coefficients
Note: Set
is_differential=Truewhen the input is already log fold-change data. For single-column input with no control, row-mean centering is automatically skipped (it would produce all zeros).
Example 2: scRNA-seq Analysis
import anndata as ad
from secactpy import secact_activity_inference_scrnaseq
# Load scRNA-seq data (788 OV CD4 T cells, 3 subtypes)
# Download: https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")
# Pseudo-bulk by cell type
result = secact_activity_inference_scrnaseq(
adata,
cell_type_col="Annotation",
is_single_cell_level=False,
verbose=True
)
# Single-cell level
result_sc = secact_activity_inference_scrnaseq(
adata,
cell_type_col="Annotation",
is_single_cell_level=True,
verbose=True
)
Example 3: Spatial Transcriptomics
Visium (spot-level)
from secactpy import secact_activity_inference_st
# Load Visium HCC data (3,415 spots)
# Download: https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
result = secact_activity_inference_st(
"Visium_HCC_data.h5ad",
min_genes=1000,
verbose=True
)
activity = result['zscore'] # (proteins × spots)
CosMx (single-cell spatial)
import anndata as ad
from secactpy import secact_activity_inference_st
# Load CosMx LIHC data (443,515 cells, 1,000 genes, 12 cell types)
# Download: https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad
adata = ad.read_h5ad("LIHC_CosMx_data.h5ad")
# Single-cell resolution (one score per cell)
result = secact_activity_inference_st(
adata,
is_spot_level=True, # Score each cell individually (default)
batch_size=5000, # Process in chunks to limit memory
output_path="cosmx_sc_results.h5ad", # Stream to disk
verbose=True
)
# result is None when output_path is set; load with ad.read_h5ad()
# Cell-type resolution (pseudo-bulk by cell type)
result = secact_activity_inference_st(
adata,
cell_type_col="cellType", # Column in adata.obs
is_spot_level=False, # Aggregate by cell type
verbose=True
)
activity = result['zscore'] # (proteins × cell_types)
Batch Processing
For large datasets (50,000+ samples), batch processing splits computation into
memory-efficient chunks while producing mathematically identical results.
The projection matrix is computed once, then samples are processed in chunks.
Set batch_size on any high-level function:
result = secact_activity_inference(expr_df, ..., batch_size=5000)
result = secact_activity_inference_scrnaseq(adata, ..., batch_size=5000)
result = secact_activity_inference_st(adata, ..., batch_size=5000)
| Mode | Parameter | Return value | Memory for output |
|---|---|---|---|
| In-memory (default) | output_path=None |
dict of DataFrames |
All results in RAM |
| Streaming | output_path="results.h5ad" |
None |
Only one batch at a time |
Setting sparse_mode=True keeps sparse Y matrices in sparse format end-to-end,
avoiding densification and reducing memory by orders of magnitude for highly
sparse single-cell data (<5% density: ~1.8x faster; results identical).
See Batch Processing for worked examples and streaming output details.
Streaming H5AD (>5M Cells)
For very large single-cell datasets (>5M cells) that exceed available RAM even
with batch processing, streaming=True bypasses full-matrix loading entirely.
The H5AD file is read in chunks via h5py using a two-pass algorithm:
- Pass 1: Read cell chunks, normalize (CPM + log2), accumulate row/column statistics
- Pass 2: Re-read chunks, compute per-chunk cross terms, run inference in sub-batches
Peak memory drops from ~200 GB to ~3 GB for a 5M-cell dataset. Results are numerically identical to the non-streaming path.
# scRNA-seq: 6.5M cells, ~3 GB peak memory
result = secact_activity_inference_scrnaseq(
"large_atlas.h5ad", # file path (not AnnData object)
cell_type_col="cell_type",
is_single_cell_level=True,
streaming=True, # enable two-pass chunk reading
streaming_chunk_size=50_000, # cells per chunk (default)
output_path="results.h5ad", # stream results to disk
verbose=True,
)
# Spatial transcriptomics: same interface
result = secact_activity_inference_st(
"large_spatial.h5ad",
streaming=True,
output_path="st_results.h5ad",
verbose=True,
)
Requirements:
streaming=Truerequiresadatato be a file path (not an in-memory AnnData),is_single_cell_level=True(scRNA-seq) oris_spot_level=True(ST), and the H5AD must store X in sparse (CSR/CSC) format.
See Batch Processing for full details.
API Reference
See API Reference for full function signatures, parameters, and options. For low-level ridge() / ridge_batch() usage, see Advanced API.
GPU Acceleration
from secactpy import secact_activity_inference, CUPY_AVAILABLE
print(f"GPU available: {CUPY_AVAILABLE}")
result = secact_activity_inference(expression, backend='auto')
| Dataset | Py (CPU) | Py (GPU) | Speedup |
|---|---|---|---|
| Bulk (1,170 sp × 1,000 samples) | 128.8s | 6.7s | 11–19x |
| scRNA-seq (1,170 sp × 788 cells) | 104.8s | 6.8s | 8–15x |
| Visium (1,170 sp × 3,404 spots) | 381.4s | 11.2s | 13–34x |
| CosMx (151 sp × 443,515 cells) | 1226.7s | 99.9s | 9–12x |
See GPU Acceleration for full benchmarks and CUDA setup. See DOCKER.md for Docker vs native performance benchmarks.
Command Line Interface
secactpy bulk -i diff_expr.tsv -o results.h5ad --differential -v
secactpy scrnaseq -i data.h5ad -o results.h5ad --cell-type-col celltype -v
secactpy visium -i /path/to/visium/ -o results.h5ad -v
secactpy cosmx -i cosmx.h5ad -o results.h5ad --batch-size 50000 -v
| Option | Description |
|---|---|
-i, --input |
Input file or directory |
-o, --output |
Output H5AD file |
-s, --signature |
Signature matrix (secact, cytosig) |
--backend |
Computation backend (auto, numpy, cupy) |
--batch-size |
Batch size for large datasets |
-v, --verbose |
Verbose output |
See CLI Reference for all commands and options.
Docker
docker pull psychemistz/secactpy:latest # CPU
docker pull psychemistz/secactpy:gpu # GPU
docker pull psychemistz/secactpy:with-r # With R SecAct/RidgeR
See DOCKER.md for detailed usage instructions.
Reproducibility
SecActPy supports three RNG backends for different reproducibility needs:
rng_method |
Description | Use case |
|---|---|---|
'srand' |
C stdlib srand()/rand() via ctypes |
Match R SecAct/RidgeR results on the same platform |
'gsl' |
Mersenne Twister (GSL-compatible) | Cross-platform reproducibility within SecActPy |
'numpy' |
Native NumPy RNG (~70x faster) | Fast analysis when reproducibility with R is not needed |
# Match R SecAct on same platform (default)
result = secact_activity_inference(expr, rng_method="srand")
# Cross-platform reproducible
result = secact_activity_inference(expr, rng_method="gsl")
# Fastest (~70x faster permutations)
result = secact_activity_inference(expr, rng_method="numpy")
See Reproducibility for detailed examples.
Requirements
- Python ≥ 3.9
- NumPy ≥ 1.20
- Pandas ≥ 1.3
- SciPy ≥ 1.7
- h5py ≥ 3.0
- anndata ≥ 0.8
- scanpy ≥ 1.9
Optional: CuPy ≥ 10.0 (GPU acceleration)
Citation
If you use SecActPy in your research, please cite:
Beibei Ru, Lanqi Gong, Emily Yang, Seongyong Park, George Zaki, Kenneth Aldape, Lalage Wakefield, Peng Jiang. Inference of secreted protein activities in intercellular communication. Nature Methods, 2026 (In press)
Related Projects
- SecAct - Original R implementation
- RidgeR - R ridge regression package
- SpaCET - Spatial transcriptomics cell type analysis
- CytoSig - Cytokine signaling inference
License
MIT License - see LICENSE for details.
Changelog
See CHANGELOG.md for full version history.
v0.2.5
- Streaming H5AD:
streaming=Truefor two-pass chunk reading of >5M-cell datasets (~3 GB peak) H5ADChunkReaderfor memory-efficient H5AD reading via h5py- Fixed H5AD index column detection for
obs.attrs['_index']convention
v0.2.4
col_centerandcol_scaleparameters for independent control of sparse in-flight normalization
v0.2.3
rng_methodparameter for explicit RNG selectionis_group_sig=Trueby default
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file secactpy-0.2.5.tar.gz.
File metadata
- Download URL: secactpy-0.2.5.tar.gz
- Upload date:
- Size: 81.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7285ce75960c048e783858a01f6620b8852c15e035d19e74ed166f6db2c1a768
|
|
| MD5 |
39de335888566a08c1264ed19d1da2fc
|
|
| BLAKE2b-256 |
60d5029de56be0dea7f55cd737df4f018df1f5dd9ff271f6267ffb1f2c00cd15
|
File details
Details for the file secactpy-0.2.5-py3-none-any.whl.
File metadata
- Download URL: secactpy-0.2.5-py3-none-any.whl
- Upload date:
- Size: 81.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8aa059de7b142e0c06bb59472df9d2bde92c6c97a25d203a3d053d699d39c6e
|
|
| MD5 |
76a796f78af9abdf4aad8d43d40e5211
|
|
| BLAKE2b-256 |
bcb227c20ec15f5f59f2abbc460478b633c1fd14202ca763f83c716345fd648c
|