Skip to main content

Secreted Protein Activity Inference using Ridge Regression

Project description

SecActPy

Secreted Protein Activity Inference using Ridge Regression

PyPI version Python 3.9+ License: MIT Tests Docker

Python implementation of SecAct for inferring secreted protein activities from gene expression data.

Key Features:

  • SecAct Compatible: Matches R SecAct (with RidgeFast/RidgeCuda accelerators) on the same platform (rng_method='srand')
  • GPU Acceleration: Optional CuPy backend for large-scale analysis
  • Million-Sample Scale: Batch processing with streaming output for massive datasets
  • Streaming H5AD: Two-pass chunk reading for >5M-cell datasets without loading the full matrix (~3 GB peak vs ~200 GB)
  • Built-in Signatures: Includes SecAct and CytoSig signature matrices
  • Multi-Platform Support: Bulk RNA-seq, scRNA-seq, and Spatial Transcriptomics (Visium, CosMx)
  • Smart Caching: Optional permutation table caching for faster repeated analyses
  • Sparse-Aware: Automatic memory-efficient processing for sparse single-cell data

Installation

Recommended: Create a virtual environment before installing to avoid dependency conflicts with other packages.

python -m venv secactpy-env
source secactpy-env/bin/activate   # Linux/macOS
# secactpy-env\Scripts\activate    # Windows

From PyPI (Recommended)

# CPU Only
pip install secactpy

# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu]"

# With GPU Support (CUDA 12.x)
pip install secactpy
pip install cupy-cuda12x

From GitHub

# CPU Only
pip install git+https://github.com/data2intelligence/SecActpy.git

# With GPU Support (CUDA 11.x)
pip install "secactpy[gpu] @ git+https://github.com/data2intelligence/SecActpy.git"

# With GPU Support (CUDA 12.x)
pip install git+https://github.com/data2intelligence/SecActpy.git
pip install cupy-cuda12x

Development Installation

git clone https://github.com/data2intelligence/SecActpy.git
cd SecActpy
pip install -e ".[dev]"

Quick Start

Example Data

Example datasets for all Quick Start tutorials are available on Zenodo:

DOI

Example Input File Output File Size
Bulk RNA-seq Ly86-Fc_vs_Vehicle_logFC.txt Ly86-Fc_vs_Vehicle_logFC_output.h5ad 0.5 MB
scRNA-seq (OV CD4 T cells) OV_scRNAseq_CD4.h5ad OV_scRNAseq_ct_CD4_output.h5ad, OV_scRNAseq_sc_CD4_output.h5ad 34 MB
Visium ST (HCC) Visium_HCC_data.h5ad Visium_HCC_output.h5ad 255 MB
CosMx (LIHC) LIHC_CosMx_data.h5ad LIHC_CosMx_output.h5ad 3.0 GB

Download all example files:

# Download individual files from Zenodo
wget https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
wget https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
wget https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
wget https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad

Example 1: Bulk RNA-seq

import pandas as pd
from secactpy import secact_activity_inference

# Load differential expression data (genes × samples)
# Download: https://zenodo.org/records/18520356/files/Ly86-Fc_vs_Vehicle_logFC.txt
diff_expr = pd.read_csv("Ly86-Fc_vs_Vehicle_logFC.txt", sep=r"\s+", index_col=0)

# Run inference
result = secact_activity_inference(
    diff_expr,
    is_differential=True,
    sig_matrix="secact",  # or "cytosig"
    verbose=True
)

# Access results
activity = result['zscore']    # Activity z-scores
pvalues = result['pvalue']     # P-values
coefficients = result['beta']  # Regression coefficients

Note: Set is_differential=True when the input is already log fold-change data. For single-column input with no control, row-mean centering is automatically skipped (it would produce all zeros).

Example 2: scRNA-seq Analysis

import anndata as ad
from secactpy import secact_activity_inference_scrnaseq

# Load scRNA-seq data (788 OV CD4 T cells, 3 subtypes)
# Download: https://zenodo.org/records/18520356/files/OV_scRNAseq_CD4.h5ad
adata = ad.read_h5ad("OV_scRNAseq_CD4.h5ad")

# Pseudo-bulk by cell type
result = secact_activity_inference_scrnaseq(
    adata,
    cell_type_col="Annotation",
    is_single_cell_level=False,
    verbose=True
)

# Single-cell level
result_sc = secact_activity_inference_scrnaseq(
    adata,
    cell_type_col="Annotation",
    is_single_cell_level=True,
    verbose=True
)

Example 3: Spatial Transcriptomics

Visium (spot-level)

from secactpy import secact_activity_inference_st

# Load Visium HCC data (3,415 spots)
# Download: https://zenodo.org/records/18520356/files/Visium_HCC_data.h5ad
result = secact_activity_inference_st(
    "Visium_HCC_data.h5ad",
    min_genes=1000,
    verbose=True
)

activity = result['zscore']  # (proteins × spots)

CosMx (single-cell spatial)

import anndata as ad
from secactpy import secact_activity_inference_st

# Load CosMx LIHC data (443,515 cells, 1,000 genes, 12 cell types)
# Download: https://zenodo.org/records/18520356/files/LIHC_CosMx_data.h5ad
adata = ad.read_h5ad("LIHC_CosMx_data.h5ad")

# Single-cell resolution (one score per cell)
result = secact_activity_inference_st(
    adata,
    is_spot_level=True,         # Score each cell individually (default)
    batch_size=5000,            # Process in chunks to limit memory
    output_path="cosmx_sc_results.h5ad",  # Stream to disk
    verbose=True
)
# result is None when output_path is set; load with ad.read_h5ad()

# Cell-type resolution (pseudo-bulk by cell type)
result = secact_activity_inference_st(
    adata,
    cell_type_col="cellType",  # Column in adata.obs
    is_spot_level=False,        # Aggregate by cell type
    verbose=True
)

activity = result['zscore']  # (proteins × cell_types)

Batch Processing

For large datasets (50,000+ samples), batch processing splits computation into memory-efficient chunks while producing mathematically identical results. The projection matrix is computed once, then samples are processed in chunks. Set batch_size on any high-level function:

result = secact_activity_inference(expr_df, ..., batch_size=5000)
result = secact_activity_inference_scrnaseq(adata, ..., batch_size=5000)
result = secact_activity_inference_st(adata, ..., batch_size=5000)
Mode Parameter Return value Memory for output
In-memory (default) output_path=None dict of DataFrames All results in RAM
Streaming output_path="results.h5ad" None Only one batch at a time

Setting sparse_mode=True keeps sparse Y matrices in sparse format end-to-end, avoiding densification and reducing memory by orders of magnitude for highly sparse single-cell data (<5% density: ~1.8x faster; results identical).

See Batch Processing for worked examples and streaming output details.

Streaming H5AD (>5M Cells)

For very large single-cell datasets (>5M cells) that exceed available RAM even with batch processing, streaming=True bypasses full-matrix loading entirely. The H5AD file is read in chunks via h5py using a two-pass algorithm:

  1. Pass 1: Read cell chunks, normalize (CPM + log2), accumulate row/column statistics
  2. Pass 2: Re-read chunks, compute per-chunk cross terms, run inference in sub-batches

Peak memory drops from ~200 GB to ~3 GB for a 5M-cell dataset. Results are numerically identical to the non-streaming path.

# scRNA-seq: 6.5M cells, ~3 GB peak memory
result = secact_activity_inference_scrnaseq(
    "large_atlas.h5ad",               # file path (not AnnData object)
    cell_type_col="cell_type",
    is_single_cell_level=True,
    streaming=True,                    # enable two-pass chunk reading
    streaming_chunk_size=50_000,       # cells per chunk (default)
    output_path="results.h5ad",        # stream results to disk
    verbose=True,
)

# Spatial transcriptomics: same interface
result = secact_activity_inference_st(
    "large_spatial.h5ad",
    streaming=True,
    output_path="st_results.h5ad",
    verbose=True,
)

Requirements: streaming=True requires adata to be a file path (not an in-memory AnnData), is_single_cell_level=True (scRNA-seq) or is_spot_level=True (ST), and the H5AD must store X in sparse (CSR/CSC) format.

See Batch Processing for full details.

API Reference

See API Reference for full function signatures, parameters, and options. For low-level ridge() / ridge_batch() usage, see Advanced API.

GPU Acceleration

from secactpy import secact_activity_inference, CUPY_AVAILABLE

print(f"GPU available: {CUPY_AVAILABLE}")
result = secact_activity_inference(expression, backend='auto')
Dataset Py (CPU) Py (GPU) Speedup
Bulk (1,170 sp × 1,000 samples) 128.8s 6.7s 11–19x
scRNA-seq (1,170 sp × 788 cells) 104.8s 6.8s 8–15x
Visium (1,170 sp × 3,404 spots) 381.4s 11.2s 13–34x
CosMx (151 sp × 443,515 cells) 1226.7s 99.9s 9–12x

See GPU Acceleration for full benchmarks and CUDA setup. See DOCKER.md for Docker vs native performance benchmarks.

Command Line Interface

secactpy bulk -i diff_expr.tsv -o results.h5ad --differential -v
secactpy scrnaseq -i data.h5ad -o results.h5ad --cell-type-col celltype -v
secactpy visium -i /path/to/visium/ -o results.h5ad -v
secactpy cosmx -i cosmx.h5ad -o results.h5ad --batch-size 50000 -v
Option Description
-i, --input Input file or directory
-o, --output Output H5AD file
-s, --signature Signature matrix (secact, cytosig)
--backend Computation backend (auto, numpy, cupy)
--batch-size Batch size for large datasets
-v, --verbose Verbose output

See CLI Reference for all commands and options.

Docker

docker pull psychemistz/secactpy:latest      # CPU
docker pull psychemistz/secactpy:gpu          # GPU
docker pull psychemistz/secactpy:with-r       # With R SecAct + RidgeFast (CPU acc)
docker pull psychemistz/secactpy:gpu-with-r   # With R SecAct + RidgeFast + RidgeCuda (GPU acc)

See DOCKER.md for Docker usage and docs/installation.md for native R-side install on Linux/macOS/Windows.

Reproducibility

SecActPy supports three RNG backends for different reproducibility needs:

rng_method Description Use case
'srand' C stdlib srand()/rand() via ctypes Match R SecAct (with RidgeFast/RidgeCuda) results on the same platform
'gsl' Mersenne Twister (GSL-compatible) Cross-platform reproducibility within SecActPy
'numpy' Native NumPy RNG (~70x faster) Fast analysis when reproducibility with R is not needed
# Match R SecAct on same platform (default)
result = secact_activity_inference(expr, rng_method="srand")

# Cross-platform reproducible
result = secact_activity_inference(expr, rng_method="gsl")

# Fastest (~70x faster permutations)
result = secact_activity_inference(expr, rng_method="numpy")

See Reproducibility for detailed examples.

Requirements

  • Python ≥ 3.9
  • NumPy ≥ 1.20
  • Pandas ≥ 1.3
  • SciPy ≥ 1.7
  • h5py ≥ 3.0
  • anndata ≥ 0.8
  • scanpy ≥ 1.9

Optional: CuPy ≥ 10.0 (GPU acceleration)

Citation

If you use SecActPy in your research, please cite:

Beibei Ru, Lanqi Gong, Emily Yang, Seongyong Park, George Zaki, Kenneth Aldape, Lalage Wakefield, Peng Jiang. Inference of secreted protein activities in intercellular communication. Nature Methods, 2026 (In press)

Related Projects

  • SecAct - Original R implementation (R-native)
  • RidgeFast - Optional CPU accelerator (R + C, cross-platform)
  • RidgeCuda - Optional GPU accelerator (R + CUDA, Linux only)
  • SpaCET - Spatial transcriptomics cell type analysis
  • CytoSig - Cytokine signaling inference

License

MIT License - see LICENSE for details.

Changelog

See CHANGELOG.md for full version history.

v0.2.5

  • Streaming H5AD: streaming=True for two-pass chunk reading of >5M-cell datasets (~3 GB peak)
  • H5ADChunkReader for memory-efficient H5AD reading via h5py
  • Fixed H5AD index column detection for obs.attrs['_index'] convention

v0.2.4

  • col_center and col_scale parameters for independent control of sparse in-flight normalization

v0.2.3

  • rng_method parameter for explicit RNG selection
  • is_group_sig=True by default

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

secactpy-0.3.1.tar.gz (81.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

secactpy-0.3.1-py3-none-any.whl (81.8 MB view details)

Uploaded Python 3

File details

Details for the file secactpy-0.3.1.tar.gz.

File metadata

  • Download URL: secactpy-0.3.1.tar.gz
  • Upload date:
  • Size: 81.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for secactpy-0.3.1.tar.gz
Algorithm Hash digest
SHA256 80c1074572670efb97a0e60a1f13109d8cf48f6db9c1584760b39a5a00f0a117
MD5 2e7030395b4e6945729d9b6ed152c725
BLAKE2b-256 9dd858302af42b5d0cb9ae72033d27e6cb920f73f98ba9f4ce6f7fa90eb82355

See more details on using hashes here.

File details

Details for the file secactpy-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: secactpy-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 81.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for secactpy-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8b8b59294178285b3d1ed5f3cd288243eed3d1e4d630ad4b35a201484dea553b
MD5 d4f442bfe8e5fa43fc079c2ae50b2925
BLAKE2b-256 d10c936a4a246f411644be4f913073812806d613c520c8e641edeec26477d1d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page