Secreted Protein Activity Inference using Ridge Regression

These details have not been verified by PyPI

Project links

Project description

SecActPy

Secreted Protein Activity Inference using Ridge Regression

Python implementation of SecAct for inferring secreted protein activities from gene expression data.

Key Features:

🎯 SecAct Compatible: Produces identical results to the R SecAct/RidgeR package
🚀 GPU Acceleration: Optional CuPy backend for large-scale analysis
📊 Million-Sample Scale: Batch processing with streaming output for massive datasets
🔬 Built-in Signatures: Includes SecAct and CytoSig signature matrices
🧬 Multi-Platform Support: Bulk RNA-seq, scRNA-seq, and Spatial Transcriptomics (Visium, CosMx)
💾 Smart Caching: Optional permutation table caching for faster repeated analyses
🧮 Sparse-Aware: Automatic memory-efficient processing for sparse single-cell data

Installation

Recommended: Create a virtual environment before installing to avoid dependency conflicts with other packages.
python -m venv secactpy-env
source secactpy-env/bin/activate   # Linux/macOS
# secactpy-env\Scripts\activate    # Windows

From PyPI (Recommended)

# CPU Only
pip install secactpy

# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu]"

# With GPU Support (CUDA 12.x)
pip install secactpy
pip install cupy-cuda12x

From GitHub

# CPU Only
pip install git+https://github.com/data2intelligence/SecActpy.git

# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu] @ git+https://github.com/data2intelligence/SecActpy.git"

# With GPU Support (CUDA 12.x)
pip install git+https://github.com/data2intelligence/SecActpy.git
pip install cupy-cuda12x

Important (CUDA 12.x users): Do not use the [gpu] extra on CUDA 12.x systems — it installs cupy-cuda11x, which conflicts with cupy-cuda12x. If you already installed with [gpu], remove the conflicting package first:
pip uninstall cupy-cuda11x
pip install cupy-cuda12x

Development Installation

git clone https://github.com/data2intelligence/SecActpy.git
cd SecActpy
pip install -e ".[dev]"

Quick Start

Example Data

Example datasets for all Quick Start tutorials are available on Zenodo:

Example	Input File	Output File	Size
Bulk RNA-seq	`Ly86-Fc_vs_Vehicle_logFC.txt`	`Ly86-Fc_vs_Vehicle_logFC_output.h5ad`	0.5 MB
scRNA-seq (OV CD4 T cells)	`OV_scRNAseq_CD4.h5ad`	`OV_scRNAseq_ct_CD4_output.h5ad`, `OV_scRNAseq_sc_CD4_output.h5ad`	34 MB
Visium ST (HCC)	`Visium_HCC_data.h5ad`	`Visium_HCC_output.h5ad`	255 MB
CosMx (LIHC)	`LIHC_CosMx_data.h5ad`	`LIHC_CosMx_output.h5ad`	3.0 GB

Download all example files:

# Download individual files from Zenodo
wget https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
wget https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
wget https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad

Example 1: Bulk RNA-seq

import pandas as pd
from secactpy import secact_activity_inference

# Load differential expression data (genes × samples)
# Download: https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
diff_expr = pd.read_csv("Ly86-Fc_vs_Vehicle_logFC.txt", sep=r"\s+", index_col=0)

# Run inference
result = secact_activity_inference(
    diff_expr,
    is_differential=True,
    sig_matrix="secact",  # or "cytosig"
    verbose=True
)

# Access results
activity = result['zscore']    # Activity z-scores
pvalues = result['pvalue']     # P-values
coefficients = result['beta']  # Regression coefficients

Note: Set is_differential=True when the input is already log fold-change data. For single-column input with no control, row-mean centering is automatically skipped (it would produce all zeros).

Example 2: scRNA-seq Analysis

import anndata as ad
from secactpy import secact_activity_inference_scrnaseq

# Load scRNA-seq data (788 OV CD4 T cells, 3 subtypes)
# Download: https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")

# Pseudo-bulk by cell type
result = secact_activity_inference_scrnaseq(
    adata,
    cell_type_col="Annotation",
    is_single_cell_level=False,
    verbose=True
)

# Single-cell level
result_sc = secact_activity_inference_scrnaseq(
    adata,
    cell_type_col="Annotation",
    is_single_cell_level=True,
    verbose=True
)

Example 3: Spatial Transcriptomics

Visium (spot-level)

from secactpy import secact_activity_inference_st

# Load Visium HCC data (3,415 spots)
# Download: https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
result = secact_activity_inference_st(
    "Visium_HCC_data.h5ad",
    min_genes=1000,
    verbose=True
)

activity = result['zscore']  # (proteins × spots)

CosMx (single-cell spatial)

import anndata as ad
from secactpy import secact_activity_inference_st

# Load CosMx LIHC data (443,515 cells, 1,000 genes, 12 cell types)
# Download: https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad
adata = ad.read_h5ad("LIHC_CosMx_data.h5ad")

# Cell-type resolution (pseudo-bulk by cell type)
result = secact_activity_inference_st(
    adata,
    cell_type_col="cellType",  # Column in adata.obs
    is_spot_level=False,        # Aggregate by cell type
    verbose=True
)

activity = result['zscore']  # (proteins × cell_types)

Large-Scale Batch Processing

What is batch processing?

By default, SecActPy loads the entire expression matrix into memory and runs ridge regression on all samples at once. This works well for most datasets, but for large-scale analyses (e.g., 50,000+ single cells or spatial spots) the memory required for permutation testing can exceed available RAM or GPU memory.

Batch processing splits the work into smaller pieces. The expensive projection matrix T = (X'X + λI)^{-1} X' is computed once from the signature, then samples are processed in chunks of batch_size at a time. Each chunk goes through the full permutation-testing pipeline independently, and partial results are concatenated at the end. The final output is mathematically identical to processing all samples at once — only peak memory usage is reduced.

All three high-level functions support batch_size and output_path:

secact_activity_inference() — bulk RNA-seq
secact_activity_inference_scrnaseq() — scRNA-seq
secact_activity_inference_st() — spatial transcriptomics

Set batch_size to enable it:

# Without batch processing: all samples at once (default)
result = secact_activity_inference(expr_df, ...)

# With batch processing: 5000 samples per chunk
result = secact_activity_inference(expr_df, ..., batch_size=5000)

# Works the same way for scRNA-seq and ST:
result = secact_activity_inference_scrnaseq(adata, ..., batch_size=5000)
result = secact_activity_inference_st(adata, ..., batch_size=5000)

In-memory vs streaming output

By default, batch results are accumulated in memory and returned as a dictionary of DataFrames — this is the in-memory mode. You get back a dict with result['zscore'], result['pvalue'], etc., just like the non-batched case.

For very large datasets, even the output matrices (beta, zscore, pvalue, se — each of shape n_proteins × n_samples) may not fit in memory. Streaming output solves this: set output_path to write each batch's results directly to an HDF5 file on disk as it completes. The function returns None in this mode — no results are held in memory. You load them back from the file when needed. All three high-level functions support this.

Mode	Parameter	Return value	Memory for output
In-memory (default)	`output_path=None`	`dict` of DataFrames	All results in RAM
Streaming	`output_path="results.h5ad"`	`None`	Only one batch at a time

# Streaming works with any high-level function:
secact_activity_inference(..., batch_size=5000, output_path="bulk_results.h5ad")
secact_activity_inference_scrnaseq(..., batch_size=5000, output_path="sc_results.h5ad")
secact_activity_inference_st(..., batch_size=5000, output_path="st_results.h5ad")

Example: batch processing with `secact_activity_inference`

secact_activity_inference handles gene subsetting, z-score normalization, signature grouping, and row expansion automatically — you just pass your expression data and set batch_size.

# Download example data (788 OV CD4 T cells, 34 MB)
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad

from secactpy import secact_activity_inference
import anndata as ad

# Load multi-sample expression data
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")

# --- In-memory mode (default) ---
# Results are returned as a dict of DataFrames
result = secact_activity_inference(
    adata.to_df().T,         # genes × cells DataFrame
    is_differential=False,   # center by row means across samples
    batch_size=200,          # process 200 cells per batch
    verbose=True
)
print(result['zscore'].head())  # (proteins × cells) DataFrame

# --- Streaming mode ---
# Results are written to disk; function returns None
secact_activity_inference(
    adata.to_df().T,
    is_differential=False,
    batch_size=200,
    output_path="results.h5ad",       # write here instead of returning
    output_compression="gzip",        # compress on disk (default)
    verbose=True
)
# Load results back when needed:
import h5py
with h5py.File("results.h5ad", "r") as f:
    zscore = f['zscore'][:]           # NumPy array (proteins × cells)

Advanced: `ridge_batch` for full control

The high-level secact_activity_inference handles gene subsetting, scaling, centering, and streaming output automatically. If you need more control — for example, to pass a sparse matrix directly or skip normalization — use the lower-level ridge_batch function.

Why dense and sparse inputs are handled differently. ridge_batch processes Y in chunks and needs whole-column statistics (mean and standard deviation) for z-score normalization. How it gets those statistics depends on the input format:

Dense (NumPy array): The function cannot compute whole-column statistics because it only sees one chunk at a time, and the full array may be too large to scan upfront. You must z-score normalize Y yourself before calling.
Sparse (scipy.sparse matrix): Computing column means and standard deviations from a sparse matrix is cheap (no dense conversion needed), so the function does this automatically upfront, then applies z-score normalization on-the-fly within each batch. This is done because sparse matrices cannot be z-scored in place without losing sparsity — the result would be fully dense.

If you do not want automatic sparse scaling, convert to dense first and normalize however you like (or not at all):

from secactpy import ridge_batch

# Opt out of auto-scaling: convert sparse to dense, apply your own processing
Y_dense = Y_sparse.toarray().astype(np.float64)
# ... apply your own normalization (or skip it) ...
result = ridge_batch(X, Y_dense, batch_size=5000)

API Reference

High-Level Functions

All three inference functions support batch_size, output_path, and output_compression for large-scale and streaming workflows.

Function	Description
`secact_activity_inference()`	Bulk RNA-seq inference
`secact_activity_inference_scrnaseq()`	scRNA-seq inference
`secact_activity_inference_st()`	Spatial transcriptomics inference
`load_signature(name='secact')`	Load built-in signature matrix

Core Functions

Function	Description
`ridge()`	Single-call ridge regression with permutation testing
`ridge_batch()`	Batch processing for large datasets (dense or sparse)
`estimate_batch_size()`	Estimate optimal batch size for available memory
`estimate_memory()`	Estimate memory requirements

Key Parameters

Parameter	Default	Description
`sig_matrix`	`"secact"`	Signature: "secact", "cytosig", or DataFrame
`lambda_`	`5e5`	Ridge regularization parameter
`n_rand`	`1000`	Number of permutations
`seed`	`0`	Random seed for reproducibility
`backend`	`'auto'`	'auto', 'numpy', or 'cupy'
`use_cache`	`False`	Cache permutation tables to disk

ST-Specific Parameters

Parameter	Default	Description
`cell_type_col`	`None`	Column in AnnData.obs for cell type
`is_spot_level`	`True`	If False, aggregate by cell type
`scale_factor`	`1e5`	Normalization scale factor

Batch Processing Parameters

Supported by all three high-level inference functions and ridge_batch().

Parameter	Default	Description
`batch_size`	`None`	Samples per batch (`None` = all at once)
`output_path`	`None`	Stream results to H5AD file (requires `batch_size`)
`output_compression`	`"gzip"`	Compression: "gzip", "lzf", or None

GPU Acceleration

from secactpy import secact_activity_inference, CUPY_AVAILABLE

print(f"GPU available: {CUPY_AVAILABLE}")

# Auto-detect GPU
result = secact_activity_inference(expression, backend='auto')

# Force GPU
result = secact_activity_inference(expression, backend='cupy')

Performance

Dataset	R (Mac M1)	R (Linux)	Py (CPU)	Py (GPU)	Speedup
Bulk (1,170 sp × 1,000 samples)	74.4s	141.6s	128.8s	6.7s	11–19x
scRNA-seq (1,170 sp × 788 cells)	54.9s	117.4s	104.8s	6.8s	8–15x
Visium (1,170 sp × 3,404 spots)	141.7s	379.8s	381.4s	11.2s	13–34x
CosMx (151 sp × 443,515 cells)	936.9s	976.1s	1226.7s	99.9s	9–12x

Benchmark Environment

Mac CPU: M1 Pro with VECLIB (8 cores)
Linux CPU: AMD EPYC 7543P (4 cores)
Linux GPU: NVIDIA A100-SXM4-80GB

Command Line Interface

SecActPy provides a command line interface for common workflows:

# Bulk RNA-seq (differential expression)
secactpy bulk -i diff_expr.tsv -o results.h5ad --differential -v

# Bulk RNA-seq (raw counts)
secactpy bulk -i counts.tsv -o results.h5ad -v

# scRNA-seq with cell type aggregation
secactpy scrnaseq -i data.h5ad -o results.h5ad --cell-type-col celltype -v

# scRNA-seq at single cell level
secactpy scrnaseq -i data.h5ad -o results.h5ad --single-cell -v

# Visium spatial transcriptomics
secactpy visium -i /path/to/visium/ -o results.h5ad -v

# CosMx (single-cell spatial)
secactpy cosmx -i cosmx.h5ad -o results.h5ad --batch-size 50000 -v

# Use GPU acceleration
secactpy bulk -i data.tsv -o results.h5ad --backend cupy -v

# Use CytoSig signature
secactpy bulk -i data.tsv -o results.h5ad --signature cytosig -v

CLI Options

Option	Description
`-i, --input`	Input file or directory
`-o, --output`	Output H5AD file
`-s, --signature`	Signature matrix (secact, cytosig)
`--lambda`	Ridge regularization (default: 5e5)
`-n, --n-rand`	Number of permutations (default: 1000)
`--backend`	Computation backend (auto, numpy, cupy)
`--batch-size`	Batch size for large datasets
`-v, --verbose`	Verbose output

Docker

Pre-built Docker images are available:

# CPU version
docker pull psychemistz/secactpy:latest

# GPU version
docker pull psychemistz/secactpy:gpu

# With R SecAct/RidgeR for cross-validation
docker pull psychemistz/secactpy:with-r

See DOCKER.md for detailed usage instructions.

Reproducibility

SecActPy produces identical results to R SecAct/RidgeR:

result = secact_activity_inference(
    expression,
    is_differential=True,
    sig_matrix="secact",
    lambda_=5e5,
    n_rand=1000,
    seed=0,
    use_gsl_rng=True  # Default: R-compatible RNG
)

For faster analysis (when R compatibility is not required):

result = secact_activity_inference(
    expression,
    use_gsl_rng=False,  # ~70x faster permutation generation
)

Requirements

Python ≥ 3.9
NumPy ≥ 1.20
Pandas ≥ 1.3
SciPy ≥ 1.7
h5py ≥ 3.0
anndata ≥ 0.8
scanpy ≥ 1.9

Optional: CuPy ≥ 10.0 (GPU acceleration)

Citation

If you use SecActPy in your research, please cite:

Beibei Ru, Lanqi Gong, Emily Yang, Seongyong Park, George Zaki, Kenneth Aldape, Lalage Wakefield, Peng Jiang. Inference of secreted protein activities in intercellular communication. GitHub: data2intelligence/SecAct

Related Projects

SecAct - Original R implementation
RidgeR - R ridge regression package
SpaCET - Spatial transcriptomics cell type analysis
CytoSig - Cytokine signaling inference

License

MIT License - see LICENSE for details.

Changelog

v0.2.1

Streaming output (output_path) support in all high-level inference functions
use_gsl_rng support in ridge_batch() for ~70x faster permutation generation
Fixed use_gsl_rng being silently ignored in batch processing
Expanded batch processing documentation with examples and downloadable data

v0.2.0 (Official Release)

Official release under data2intelligence organization
PyPI package available (pip install secactpy)
Comprehensive test suite and CI/CD pipeline
Docker images with GPU and R support

v0.1.2 (Initial Development)

Ridge regression with permutation-based significance testing
GPU acceleration via CuPy backend (9–34x speedup)
Batch processing with streaming H5AD output for million-sample datasets
Automatic sparse matrix handling in ridge_batch()
Built-in SecAct and CytoSig signature matrices
GSL-compatible RNG for R/RidgeR reproducibility
Support for Bulk RNA-seq, scRNA-seq, and Spatial Transcriptomics
Cell type resolution for ST data (cell_type_col, is_spot_level)
Optional permutation table caching (use_cache)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.1

May 14, 2026

0.3.0

May 14, 2026

0.2.5

Mar 8, 2026

0.2.3

Feb 9, 2026

0.2.2

Feb 8, 2026

This version

0.2.1

Feb 8, 2026

0.2.0

Jan 6, 2026

0.1.2

Dec 30, 2025

0.1.1

Dec 30, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

secactpy-0.2.1.tar.gz (81.8 MB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

secactpy-0.2.1-py3-none-any.whl (81.7 MB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file secactpy-0.2.1.tar.gz.

File metadata

Download URL: secactpy-0.2.1.tar.gz
Upload date: Feb 8, 2026
Size: 81.8 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for secactpy-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`9ec20dd3715002d1b702c29d091ff24681a04922557fddc478f496129bc27a92`
MD5	`68f4e68390bec4da41d600d18b47654f`
BLAKE2b-256	`0ac7e908e5132b042235419c9a5d95b05907e3b4a29d442de050d3b83c06fb0d`

See more details on using hashes here.

File details

Details for the file secactpy-0.2.1-py3-none-any.whl.

File metadata

Download URL: secactpy-0.2.1-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 81.7 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for secactpy-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`537e7d8cd6ddb23716000b61f3fe6f31b04186bc44ef7c726487820edbb201a5`
MD5	`700216eb2fd8497f685c92ebd24795d4`
BLAKE2b-256	`f6320adb0e1a24856128f12fcd530131c82292bd39f7a67a28bc0a0870b5111f`

See more details on using hashes here.

secactpy 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SecActPy

Installation

From PyPI (Recommended)

From GitHub

Development Installation

Quick Start

Example Data

Example 1: Bulk RNA-seq

Example 2: scRNA-seq Analysis

Example 3: Spatial Transcriptomics

Visium (spot-level)

CosMx (single-cell spatial)

Large-Scale Batch Processing

What is batch processing?

In-memory vs streaming output

Example: batch processing with secact_activity_inference

Advanced: ridge_batch for full control

API Reference

High-Level Functions

Core Functions

Key Parameters

ST-Specific Parameters

Batch Processing Parameters

GPU Acceleration

Performance

Command Line Interface

CLI Options

Docker

Reproducibility

Requirements

Citation

Related Projects

License

Changelog

v0.2.1

v0.2.0 (Official Release)

v0.1.2 (Initial Development)

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Example: batch processing with `secact_activity_inference`

Advanced: `ridge_batch` for full control