Skip to main content

Python implementation of the PAGE algorithm

Project description

pyPAGE

pyPAGE is a Python implementation of the conditional-information PAGE framework for gene-set enrichment analysis.

It is designed to infer differential activity of pathways and regulons while accounting for annotation and membership biases using information-theoretic methods.

Approach

Bulk PAGE

Standard gene-set enrichment methods test whether pathway members are non-randomly distributed across a ranked gene list. pyPAGE frames this as an information-theoretic question: how much does knowing a gene's pathway membership tell you about its expression bin?

  1. Discretize continuous expression scores (e.g. log2 fold-change) into equal-frequency bins
  2. Compute mutual information (MI) between expression bins and pathway membership — or conditional MI (CMI), which conditions on how many pathways each gene belongs to, correcting for the bias that heavily-annotated genes drive spurious enrichment
  3. Permutation test to assess significance, with early stopping
  4. Redundancy filtering removes pathways whose signal is explained by an already-accepted pathway (via CMI between memberships)
  5. Hypergeometric enrichment per bin produces the iPAGE-style heatmap showing which expression bins drive each pathway's signal

Single-Cell PAGE

For single-cell data, the question becomes: are pathway scores spatially coherent across the cell manifold? A pathway whose activity varies smoothly across cell states (rather than randomly) is biologically meaningful.

  1. Per-cell scoring — for each cell, compute MI or CMI between gene expression bins and pathway membership across all genes. This produces an (n_cells x n_pathways) score matrix
  2. KNN graph — build a cell-cell k-nearest-neighbor graph from expression (or use a precomputed one from scanpy)
  3. Geary's C — measure spatial autocorrelation of each pathway's scores on the KNN graph. Report C' = 1 - C, where higher values mean the pathway varies coherently across the manifold rather than randomly
  4. Permutation test — generate size-matched random gene sets, compute their C', and derive empirical p-values with BH FDR correction

Installation

Install from PyPI:

pip install bio-pypage

Or install from source:

git clone https://github.com/goodarzilab/pyPAGE
cd pyPAGE
pip install -e .

Quick Start

import pandas as pd
from pypage import PAGE, ExpressionProfile, GeneSets

# 1) Load expression profile (gene, score)
expr = pd.read_csv(
    "example_data/AP2S1.tab.gz",
    sep="\t",
    header=None,
    names=["gene", "score"],
)
exp = ExpressionProfile(expr["gene"], expr["score"], is_bin=True)

# 2) Load annotation (gene, pathway)
ann = pd.read_csv(
    "example_data/GO_BP_2021_index.txt.gz",
    sep="\t",
    header=None,
    names=["gene", "pathway"],
)
gs = GeneSets(ann["gene"], ann["pathway"])

# 3) Run pyPAGE
p = PAGE(exp, gs, n_shuffle=100, k=7, filter_redundant=True)
results, heatmap = p.run()

print(results.head())
heatmap.show()

results contains:

  • pathway
  • CMI — conditional mutual information score
  • z-score — z-score of observed CMI vs. permutation null distribution
  • p-value — empirical p-value from permutation test
  • Regulation pattern (1 for up, -1 for down)

Tutorial: DESeq2 Log Fold-Change with Hallmark Gene Sets

This example runs pyPAGE on DESeq2 differential expression results (log2 fold-change) against MSigDB Hallmark gene sets. The input file and GMT are included in example_data/.

Step 1 — Run pyPAGE:

pypage -e example_data/test_DESeq_logFC.txt.gz \
    --gmt example_data/h.all.v2026.1.Hs.symbols.gmt \
    --cols GENE,log2FoldChange --seed 42

This creates example_data/test_DESeq_logFC_PAGE/ with results, heatmap, and enrichment matrix.

Step 2 — Re-plot with custom color scale:

pypage --draw-only -e example_data/test_DESeq_logFC.txt.gz \
    --min-val -2 --max-val 3 --bar-min -1 --bar-max 1

--draw-only re-renders the heatmap from the saved matrix without re-running the analysis. --min-val/--max-val control the enrichment color scale; --bar-min/--bar-max normalize the bin-edge indicator bar.

Example output:

pyPAGE heatmap

Loading Gene Sets

Gene sets can be loaded from multiple sources:

Paired arrays — gene and pathway name arrays of equal length:

gs = GeneSets(genes=gene_array, pathways=pathway_array)

Annotation index files — tab-delimited files where each line starts with a pathway name followed by its member genes (supports gzip):

gs = GeneSets(ann_file="GO_BP_2021_index.txt.gz")
# If the first column is genes (not pathways):
gs = GeneSets(ann_file="annotations.txt", first_col_is_genes=True)

GMT files — MSigDB .gmt format (plain or gzipped), with optional size filtering:

gs = GeneSets.from_gmt("h.all.v2024.1.Hs.symbols.gmt")
gs = GeneSets.from_gmt("c2.cp.kegg.gmt", min_size=15, max_size=500)

# Export back to GMT
gs.to_gmt("filtered_pathways.gmt")

Loading Expression Data

Expression input can be:

  • Continuous differential scores (is_bin=False, default) — auto-discretized into n_bins equal-frequency bins
  • Pre-binned integer labels (is_bin=True) — used as-is
# Continuous scores
exp = ExpressionProfile(genes, scores, n_bins=10)

# Pre-binned labels
exp = ExpressionProfile(genes, bin_labels, is_bin=True)

Gene ID Conversion

GeneMapper (Recommended)

GeneMapper downloads a gene ID mapping table from Ensembl once and caches it locally (~5 MB at ~/.pypage/) for fast offline lookups. Supported ID types: 'ensg', 'symbol', 'entrez'.

from pypage import GeneMapper, GeneSets

# First call downloads from Ensembl; subsequent calls use cache
mapper = GeneMapper(species='human')

# Convert gene IDs
symbols, unmapped = mapper.convert(
    ['ENSG00000141510', 'ENSG00000012048'],
    from_type='ensg', to_type='symbol',
)
# symbols -> ['TP53', 'BRCA1']

# Convert genes in-place on a GeneSets object
gs = GeneSets.from_gmt("kegg_entrez.gmt")
gs.map_genes(mapper, from_type='entrez', to_type='symbol')

Legacy: convert_from_to() (Requires Network)

ExpressionProfile.convert_from_to(), GeneSets.convert_from_to(), and Heatmap.convert_from_to() use Ensembl BioMart (pybiomart) and require an active internet connection:

exp.convert_from_to("refseq", "ensg", "human")

Command Line

After installation, pypage is available as a command-line tool. All outputs are saved to an auto-created output directory (default: {expression_stem}_PAGE/).

# Basic usage — outputs go to expression_PAGE/ directory
pypage -e expression.tab.gz --genesets-long annotations.txt.gz --is-bin

# With GMT file
pypage -e scores.tab --gmt pathways.gmt --n-bins 10

# Explicit output directory
pypage -e expr.tab.gz --gmt pathways.gmt --outdir my_results/

# Manual pathway mode (bypass significance testing)
pypage -e expr.tab.gz --genesets-long ann.txt.gz --is-bin \
    --manual "apoptotic process,cell cycle"

# With index-format gene sets
pypage -e expr.tab.gz -g index_annotations.txt.gz --is-bin

# Reproducible run with seed
pypage -e expr.tab.gz --gmt pathways.gmt --seed 42

Output Files

The output directory contains:

  • results.tsv — pathway results with CMI, z-score, p-value (scientific notation), and regulation pattern
  • results.matrix.tsv — enrichment score matrix (for re-plotting)
  • results.killed.tsv — redundancy filtering log
  • heatmap.pdf — iPAGE-style enrichment heatmap (editable fonts for Illustrator)
  • heatmap.html — interactive HTML heatmap

Visualization Options

# Custom color scale (asymmetric min/max)
pypage -e expr.tab --gmt pathways.gmt --min-val -2 --max-val 5

# Custom bin-edge bar normalization
pypage -e expr.tab --gmt pathways.gmt --bar-min -1 --bar-max 1

# Different colormap
pypage -e expr.tab --gmt pathways.gmt --cmap RdBu_r

# Re-plot from saved matrix (no re-analysis)
pypage --draw-only -e expr.tab --min-val -2 --max-val 3 --bar-min -1 --bar-max 1

Run pypage --help for a full list of options.

Single-Cell Command Line

After installation, pypage-sc is available for single-cell PAGE analysis. It computes per-cell pathway scores using MI/CMI, tests spatial coherence on the cell-cell KNN graph via Geary's C, and produces an interactive VISION-like HTML report.

Basic Usage

# With AnnData h5ad file
pypage-sc --adata data.h5ad --gmt pathways.gmt

# With expression matrix + gene names
pypage-sc --expression matrix.tsv --genes genes.txt --genesets-long ann.txt.gz

# If adata.var_names are Ensembl IDs and gene symbols are in a column
pypage-sc --adata data.h5ad --gene-column gene --gmt pathways.gmt

# Manual mode (bypass permutation testing)
pypage-sc --adata data.h5ad --gmt pathways.gmt \
    --manual "HALLMARK_INTERFERON_ALPHA_RESPONSE,HALLMARK_G2M_CHECKPOINT"

Example: Colorectal Cancer (CRC) Atlas

This example uses a CRC scRNA-seq dataset from CZ CELLxGENE (13,843 cells) with Hallmark gene sets. Since the AnnData uses Ensembl IDs as var_names, the --gene-column flag maps to gene symbols stored in adata.var['gene'].

pypage-sc --adata CRC.h5ad --gene-column gene \
    --gmt h.all.v2026.1.Hs.symbols.gmt --seed 42 --n-jobs 4

This creates the output directory CRC_scPAGE/ with:

CRC_scPAGE/
  results.tsv          # pathway results (consistency, p-value, FDR)
  ranking.pdf          # consistency ranking bar chart
  ranking.html         # interactive ranking bars
  report.html          # VISION-like interactive report
  adata.h5ad           # AnnData with scPAGE_ scores in .obs
  umap_plots/          # per-pathway UMAP PDFs (top 10)
    HALLMARK_INTERFERON_ALPHA_RESPONSE.pdf
    ...

The top pathway is HALLMARK_INTERFERON_ALPHA_RESPONSE (consistency C' = 0.52, FDR = 0.005), showing strong spatial coherence of interferon signaling across the cell manifold.

report.html is a fully self-contained interactive report (no external dependencies). Open it in any browser to:

  • Browse all pathways in a searchable sidebar, sorted by consistency
  • Click a pathway to color the UMAP by per-cell scores
  • Switch between available embeddings (UMAP, t-SNE, PCA)

adata.h5ad contains all pathway scores as scPAGE_* columns in adata.obs, ready for downstream analysis with scanpy:

import scanpy as sc
adata = sc.read_h5ad("CRC_scPAGE/adata.h5ad")
sc.pl.umap(adata, color="scPAGE_HALLMARK_INTERFERON_ALPHA_RESPONSE")

Output Control

# Disable interactive report
pypage-sc --adata data.h5ad --gmt pathways.gmt --no-report

# Disable saving annotated AnnData
pypage-sc --adata data.h5ad --gmt pathways.gmt --no-save-adata

# Change number of UMAP PDF plots
pypage-sc --adata data.h5ad --gmt pathways.gmt --umap-top-n 20

# Specify embedding for UMAP plots
pypage-sc --adata data.h5ad --gmt pathways.gmt --embedding-key X_tsne

# Save per-cell scores matrix as TSV
pypage-sc --adata data.h5ad --gmt pathways.gmt --scores scores.tsv

Run pypage-sc --help for a full list of options.

Bulk PAGE Analysis

The PAGE class performs pathway enrichment analysis with permutation testing and optional redundancy filtering:

p = PAGE(exp, gs,
    function='cmi',           # 'cmi' (default, corrects annotation bias) or 'mi'
    n_shuffle=10000,          # permutation count
    alpha=0.005,              # p-value threshold
    k=20,                     # early-stopping parameter
    filter_redundant=True,    # remove redundant pathways (default)
    redundancy_ratio=5.0,     # CMI/MI ratio threshold
    n_jobs=1,                 # parallel threads
)
results, heatmap = p.run()

# Enriched genes per pathway
enriched = p.get_enriched_genes("pathway_name")

# Enrichment score matrix (log10 hypergeometric p-values)
es_matrix = p.get_es_matrix()

Advanced Features

Manual Pathway Analysis

Analyze specific pathways without significance testing using run_manual():

p = PAGE(exp, gs)
results, heatmap = p.run_manual(["apoptotic process", "cell cycle"])

This bypasses permutation testing and redundancy filtering, computing enrichment statistics and a heatmap for only the specified pathways. Useful for inspecting known pathways of interest.

Inspecting Redundancy Filtering

After a standard run() with filter_redundant=True, inspect which pathways were removed and why:

p = PAGE(exp, gs)
results, heatmap = p.run()

# DataFrame with columns: rejected_pathway, killed_by, min_ratio
killed = p.get_redundancy_log()
print(killed)

Full Results with Redundancy Flags

full_results contains all informative pathways (before redundancy filtering) with a redundant column:

p = PAGE(exp, gs)
results, heatmap = p.run()

# All informative pathways, with redundant=True/False
print(p.full_results)

Single-Cell Analysis

SingleCellPAGE brings per-cell pathway scoring and spatial coherence testing to pyPAGE, inspired by VISION. It accepts AnnData objects or raw numpy arrays.

import anndata
from pypage import GeneSets, SingleCellPAGE

adata = anndata.read_h5ad("my_data.h5ad")
gs = GeneSets(ann_file="annotations.txt.gz")

sc = SingleCellPAGE(adata=adata, genesets=gs, function='cmi', n_jobs=4)
results = sc.run(n_permutations=1000)

print(results.head())

results contains:

  • pathway
  • consistency — spatial autocorrelation score (C' = 1 - Geary's C; higher = more coherent)
  • p-value — empirical p-value from size-matched random gene sets
  • FDR — Benjamini-Hochberg corrected p-value

Visualization

sc.plot_pathway_on_embedding("MyPathway", embedding_key='X_umap')
sc.plot_consistency_ranking(top_n=20)
sc.plot_pathway_heatmap(adata.obs['leiden'])

Neighborhood Mode

Aggregate cells by cluster labels and run standard bulk PAGE per group:

summary, group_results = sc.run_neighborhoods(labels=adata.obs['leiden'])

Input Options

Input How
AnnData SingleCellPAGE(adata=adata, genesets=gs)
Numpy arrays SingleCellPAGE(expression=X, genes=gene_names, genesets=gs)
Precomputed KNN SingleCellPAGE(adata=adata, genesets=gs, connectivity=W)

Parameter Reference

PAGE

Parameter Default Description
function 'cmi' 'cmi' (conditional MI, corrects annotation bias) or 'mi'
n_shuffle 10000 Number of permutations for significance testing
alpha 0.005 P-value threshold for informative pathways
k 20 Early-stopping: stop after k consecutive non-significant pathways
filter_redundant True Remove redundant pathways via CMI
redundancy_ratio 5.0 CMI/MI ratio threshold; pathways with all ratios above this are kept
n_jobs 1 Number of parallel threads

SingleCellPAGE

Parameter Default Description
function 'cmi' 'cmi' or 'mi'
n_bins 10 Number of bins for expression discretization
n_neighbors ceil(sqrt(n_cells)) KNN neighbors (capped at 100)
connectivity None Precomputed cell-cell connectivity matrix
n_jobs 1 Number of parallel threads (0 or None for all available)

GeneMapper

Parameter Default Description
species 'human' 'human' or 'mouse'
cache_dir '~/.pypage/' Directory for cached mapping file

Reproducibility Tips

For deterministic benchmark-style runs:

import numpy as np
np.random.seed(0)
p = PAGE(exp, gs, n_shuffle=100, n_jobs=1)

Tutorials

Testing

Fast local test profile (default CI profile):

pytest -q -m "not slow and not online"

Full test profile (includes long and network-dependent tests):

PYPAGE_RUN_ONLINE_TESTS=1 pytest -q

Documentation

For full API details, see MANUAL.md.

Citation

Bakulin A, Teyssier NB, Kampmann M, Khoroshkin M, Goodarzi H (2024) pyPAGE: A framework for Addressing biases in gene-set enrichment analysis—A case study on Alzheimer's disease. PLoS Computational Biology 20(9): e1012346. https://doi.org/10.1371/journal.pcbi.1012346

License

MIT

About

pyPAGE was developed in the Goodarzi Lab at UCSF by Artemy Bakulin, Noam B. Teyssier, and Hani Goodarzi.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio_pypage-0.2.0.tar.gz (70.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bio_pypage-0.2.0-py3-none-any.whl (62.9 kB view details)

Uploaded Python 3

File details

Details for the file bio_pypage-0.2.0.tar.gz.

File metadata

  • Download URL: bio_pypage-0.2.0.tar.gz
  • Upload date:
  • Size: 70.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for bio_pypage-0.2.0.tar.gz
Algorithm Hash digest
SHA256 7fc1acb62c8f713a0c171264f9bfed33011121dc153b12753a5f3586126bd569
MD5 a47e3e8bd951f11e085e08e9823e41ae
BLAKE2b-256 0393e50da9065b7441c000c5d5a5f93b5a8f4658eb906b14dcd3f7979bc9ba2d

See more details on using hashes here.

File details

Details for the file bio_pypage-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: bio_pypage-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 62.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for bio_pypage-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d628232e360a98f994d1986a506cb909ad32512563370d4409d063ec1cbbd793
MD5 4c719890def9f3a6bb6ff70df82a3408
BLAKE2b-256 d68887726e43dff3b790531881b136bd9af96cfb5280fab6fc02f21dfd9231d6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page