Python implementation of the PAGE algorithm
Project description
pyPAGE
pyPAGE is a Python implementation of the conditional-information PAGE framework for gene-set enrichment analysis.
It is designed to infer differential activity of pathways and regulons while accounting for annotation and membership biases using information-theoretic methods.
Approach
Bulk PAGE
Standard gene-set enrichment methods test whether pathway members are non-randomly distributed across a ranked gene list. pyPAGE frames this as an information-theoretic question: how much does knowing a gene's pathway membership tell you about its expression bin?
- Discretize continuous expression scores (e.g. log2 fold-change) into equal-frequency bins
- Compute mutual information (MI) between expression bins and pathway membership — or conditional MI (CMI), which conditions on how many pathways each gene belongs to, correcting for the bias that heavily-annotated genes drive spurious enrichment
- Permutation test to assess significance, with early stopping
- Redundancy filtering removes pathways whose signal is explained by an already-accepted pathway (via CMI between memberships)
- Hypergeometric enrichment per bin produces the iPAGE-style heatmap showing which expression bins drive each pathway's signal
Single-Cell PAGE
For single-cell data, the question becomes: are pathway scores spatially coherent across the cell manifold? A pathway whose activity varies smoothly across cell states (rather than randomly) is biologically meaningful.
- Per-cell scoring — for each cell, compute MI or CMI between gene expression bins and pathway membership across all genes. This produces an (n_cells x n_pathways) score matrix
- KNN graph — build a cell-cell k-nearest-neighbor graph from expression (or use a precomputed one from scanpy)
- Geary's C — measure spatial autocorrelation of each pathway's scores on the KNN graph. Report C' = 1 - C, where higher values mean the pathway varies coherently across the manifold rather than randomly
- Permutation test — generate size-matched random gene sets, compute their C', and derive empirical p-values with BH FDR correction
Installation
Install from PyPI:
pip install bio-pypage
Or install from source:
git clone https://github.com/goodarzilab/pyPAGE
cd pyPAGE
pip install -e .
Quick Start
import pandas as pd
from pypage import PAGE, ExpressionProfile, GeneSets
# 1) Load expression profile (gene, score)
expr = pd.read_csv(
"example_data/AP2S1.tab.gz",
sep="\t",
header=None,
names=["gene", "score"],
)
exp = ExpressionProfile(expr["gene"], expr["score"], is_bin=True)
# 2) Load annotation (gene, pathway)
ann = pd.read_csv(
"example_data/GO_BP_2021_index.txt.gz",
sep="\t",
header=None,
names=["gene", "pathway"],
)
gs = GeneSets(ann["gene"], ann["pathway"])
# 3) Run pyPAGE
p = PAGE(exp, gs, n_shuffle=100, k=7, filter_redundant=True)
results, heatmap = p.run()
print(results.head())
heatmap.show()
results contains:
pathwayCMI— conditional mutual information scorez-score— z-score of observed CMI vs. permutation null distributionp-value— empirical p-value from permutation testRegulation pattern(1for up,-1for down)
Tutorial: DESeq2 Log Fold-Change with Hallmark Gene Sets
This example runs pyPAGE on DESeq2 differential expression results (log2 fold-change) against MSigDB Hallmark gene sets. The input file and GMT are included in example_data/.
Step 1 — Run pyPAGE:
pypage -e example_data/test_DESeq_logFC.txt.gz \
--gmt example_data/h.all.v2026.1.Hs.symbols.gmt \
--cols GENE,log2FoldChange --seed 42
This creates example_data/test_DESeq_logFC_PAGE/ with results, heatmap, and enrichment matrix.
Step 2 — Re-plot with custom color scale:
pypage --draw-only -e example_data/test_DESeq_logFC.txt.gz \
--min-val -2 --max-val 3 --bar-min -1 --bar-max 1
--draw-only re-renders the heatmap from the saved matrix without re-running the analysis. --min-val/--max-val control the enrichment color scale; --bar-min/--bar-max normalize the bin-edge indicator bar.
Example output:
Loading Gene Sets
Gene sets can be loaded from multiple sources:
Paired arrays — gene and pathway name arrays of equal length:
gs = GeneSets(genes=gene_array, pathways=pathway_array)
Annotation index files — tab-delimited files where each line starts with a pathway name followed by its member genes (supports gzip):
gs = GeneSets(ann_file="GO_BP_2021_index.txt.gz")
# If the first column is genes (not pathways):
gs = GeneSets(ann_file="annotations.txt", first_col_is_genes=True)
GMT files — MSigDB .gmt format (plain or gzipped), with optional size filtering:
gs = GeneSets.from_gmt("h.all.v2024.1.Hs.symbols.gmt")
gs = GeneSets.from_gmt("c2.cp.kegg.gmt", min_size=15, max_size=500)
# Export back to GMT
gs.to_gmt("filtered_pathways.gmt")
Loading Expression Data
Expression input can be:
- Continuous differential scores (
is_bin=False, default) — auto-discretized inton_binsequal-frequency bins - Pre-binned integer labels (
is_bin=True) — used as-is
# Continuous scores
exp = ExpressionProfile(genes, scores, n_bins=10)
# Pre-binned labels
exp = ExpressionProfile(genes, bin_labels, is_bin=True)
Gene ID Conversion
GeneMapper (Recommended)
GeneMapper downloads a gene ID mapping table from Ensembl once and caches it locally (~5 MB at ~/.pypage/) for fast offline lookups. Supported ID types: 'ensg', 'symbol', 'entrez'.
from pypage import GeneMapper, GeneSets
# First call downloads from Ensembl; subsequent calls use cache
mapper = GeneMapper(species='human')
# Convert gene IDs
symbols, unmapped = mapper.convert(
['ENSG00000141510', 'ENSG00000012048'],
from_type='ensg', to_type='symbol',
)
# symbols -> ['TP53', 'BRCA1']
# Convert genes in-place on a GeneSets object
gs = GeneSets.from_gmt("kegg_entrez.gmt")
gs.map_genes(mapper, from_type='entrez', to_type='symbol')
Legacy: convert_from_to() (Requires Network)
ExpressionProfile.convert_from_to(), GeneSets.convert_from_to(), and Heatmap.convert_from_to() use Ensembl BioMart (pybiomart) and require an active internet connection:
exp.convert_from_to("refseq", "ensg", "human")
Command Line
After installation, pypage is available as a command-line tool. All outputs are saved to an auto-created output directory (default: {expression_stem}_PAGE/).
# Basic usage — outputs go to expression_PAGE/ directory
pypage -e expression.tab.gz --genesets-long annotations.txt.gz --is-bin
# With GMT file
pypage -e scores.tab --gmt pathways.gmt --n-bins 10
# Explicit output directory
pypage -e expr.tab.gz --gmt pathways.gmt --outdir my_results/
# Manual pathway mode (bypass significance testing)
pypage -e expr.tab.gz --genesets-long ann.txt.gz --is-bin \
--manual "apoptotic process,cell cycle"
# With index-format gene sets
pypage -e expr.tab.gz -g index_annotations.txt.gz --is-bin
# Reproducible run with seed
pypage -e expr.tab.gz --gmt pathways.gmt --seed 42
Output Files
The output directory contains:
results.tsv— pathway results with CMI, z-score, p-value (scientific notation), and regulation patternresults.matrix.tsv— enrichment score matrix (for re-plotting)results.killed.tsv— redundancy filtering logheatmap.pdf— iPAGE-style enrichment heatmap (editable fonts for Illustrator)heatmap.html— interactive HTML heatmap
Visualization Options
# Custom color scale (asymmetric min/max)
pypage -e expr.tab --gmt pathways.gmt --min-val -2 --max-val 5
# Custom bin-edge bar normalization
pypage -e expr.tab --gmt pathways.gmt --bar-min -1 --bar-max 1
# Different colormap
pypage -e expr.tab --gmt pathways.gmt --cmap RdBu_r
# Re-plot from saved matrix (no re-analysis)
pypage --draw-only -e expr.tab --min-val -2 --max-val 3 --bar-min -1 --bar-max 1
Run pypage --help for a full list of options.
Single-Cell Command Line
After installation, pypage-sc is available for single-cell PAGE analysis. It computes per-cell pathway scores using MI/CMI, tests spatial coherence on the cell-cell KNN graph via Geary's C, and produces an interactive VISION-like HTML report.
Basic Usage
# With AnnData h5ad file
pypage-sc --adata data.h5ad --gmt pathways.gmt
# With expression matrix + gene names
pypage-sc --expression matrix.tsv --genes genes.txt --genesets-long ann.txt.gz
# If adata.var_names are Ensembl IDs and gene symbols are in a column
pypage-sc --adata data.h5ad --gene-column gene --gmt pathways.gmt
# Manual mode (bypass permutation testing)
pypage-sc --adata data.h5ad --gmt pathways.gmt \
--manual "HALLMARK_INTERFERON_ALPHA_RESPONSE,HALLMARK_G2M_CHECKPOINT"
Example: Colorectal Cancer (CRC) Atlas
This example uses a CRC scRNA-seq dataset from CZ CELLxGENE (13,843 cells) with Hallmark gene sets. Since the AnnData uses Ensembl IDs as var_names, the --gene-column flag maps to gene symbols stored in adata.var['gene'].
pypage-sc --adata CRC.h5ad --gene-column gene \
--gmt h.all.v2026.1.Hs.symbols.gmt --seed 42 --n-jobs 4
This creates the output directory CRC_scPAGE/ with:
CRC_scPAGE/
results.tsv # pathway results (consistency, p-value, FDR)
ranking.pdf # consistency ranking bar chart
ranking.html # interactive ranking bars
report.html # VISION-like interactive report
adata.h5ad # AnnData with scPAGE_ scores in .obs
umap_plots/ # per-pathway UMAP PDFs (top 10)
HALLMARK_INTERFERON_ALPHA_RESPONSE.pdf
...
The top pathway is HALLMARK_INTERFERON_ALPHA_RESPONSE (consistency C' = 0.52, FDR = 0.005), showing strong spatial coherence of interferon signaling across the cell manifold.
report.html is a fully self-contained interactive report (no external dependencies). Open it in any browser to:
- Browse all pathways in a searchable sidebar, sorted by consistency
- Click a pathway to color the UMAP by per-cell scores
- Switch between available embeddings (UMAP, t-SNE, PCA)
adata.h5ad contains all pathway scores as scPAGE_* columns in adata.obs, ready for downstream analysis with scanpy:
import scanpy as sc
adata = sc.read_h5ad("CRC_scPAGE/adata.h5ad")
sc.pl.umap(adata, color="scPAGE_HALLMARK_INTERFERON_ALPHA_RESPONSE")
Output Control
# Disable interactive report
pypage-sc --adata data.h5ad --gmt pathways.gmt --no-report
# Disable saving annotated AnnData
pypage-sc --adata data.h5ad --gmt pathways.gmt --no-save-adata
# Change number of UMAP PDF plots
pypage-sc --adata data.h5ad --gmt pathways.gmt --umap-top-n 20
# Specify embedding for UMAP plots
pypage-sc --adata data.h5ad --gmt pathways.gmt --embedding-key X_tsne
# Save per-cell scores matrix as TSV
pypage-sc --adata data.h5ad --gmt pathways.gmt --scores scores.tsv
Run pypage-sc --help for a full list of options.
Bulk PAGE Analysis
The PAGE class performs pathway enrichment analysis with permutation testing and optional redundancy filtering:
p = PAGE(exp, gs,
function='cmi', # 'cmi' (default, corrects annotation bias) or 'mi'
n_shuffle=10000, # permutation count
alpha=0.005, # p-value threshold
k=20, # early-stopping parameter
filter_redundant=True, # remove redundant pathways (default)
redundancy_ratio=5.0, # CMI/MI ratio threshold
n_jobs=1, # parallel threads
)
results, heatmap = p.run()
# Enriched genes per pathway
enriched = p.get_enriched_genes("pathway_name")
# Enrichment score matrix (log10 hypergeometric p-values)
es_matrix = p.get_es_matrix()
Advanced Features
Manual Pathway Analysis
Analyze specific pathways without significance testing using run_manual():
p = PAGE(exp, gs)
results, heatmap = p.run_manual(["apoptotic process", "cell cycle"])
This bypasses permutation testing and redundancy filtering, computing enrichment statistics and a heatmap for only the specified pathways. Useful for inspecting known pathways of interest.
Inspecting Redundancy Filtering
After a standard run() with filter_redundant=True, inspect which pathways were removed and why:
p = PAGE(exp, gs)
results, heatmap = p.run()
# DataFrame with columns: rejected_pathway, killed_by, min_ratio
killed = p.get_redundancy_log()
print(killed)
Full Results with Redundancy Flags
full_results contains all informative pathways (before redundancy filtering) with a redundant column:
p = PAGE(exp, gs)
results, heatmap = p.run()
# All informative pathways, with redundant=True/False
print(p.full_results)
Single-Cell Analysis
SingleCellPAGE brings per-cell pathway scoring and spatial coherence testing to pyPAGE, inspired by VISION. It accepts AnnData objects or raw numpy arrays.
import anndata
from pypage import GeneSets, SingleCellPAGE
adata = anndata.read_h5ad("my_data.h5ad")
gs = GeneSets(ann_file="annotations.txt.gz")
sc = SingleCellPAGE(adata=adata, genesets=gs, function='cmi', n_jobs=4)
results = sc.run(n_permutations=1000)
print(results.head())
results contains:
pathwayconsistency— spatial autocorrelation score (C' = 1 - Geary's C; higher = more coherent)p-value— empirical p-value from size-matched random gene setsFDR— Benjamini-Hochberg corrected p-value
Visualization
sc.plot_pathway_on_embedding("MyPathway", embedding_key='X_umap')
sc.plot_consistency_ranking(top_n=20)
sc.plot_pathway_heatmap(adata.obs['leiden'])
Neighborhood Mode
Aggregate cells by cluster labels and run standard bulk PAGE per group:
summary, group_results = sc.run_neighborhoods(labels=adata.obs['leiden'])
Input Options
| Input | How |
|---|---|
| AnnData | SingleCellPAGE(adata=adata, genesets=gs) |
| Numpy arrays | SingleCellPAGE(expression=X, genes=gene_names, genesets=gs) |
| Precomputed KNN | SingleCellPAGE(adata=adata, genesets=gs, connectivity=W) |
Parameter Reference
PAGE
| Parameter | Default | Description |
|---|---|---|
function |
'cmi' |
'cmi' (conditional MI, corrects annotation bias) or 'mi' |
n_shuffle |
10000 |
Number of permutations for significance testing |
alpha |
0.005 |
P-value threshold for informative pathways |
k |
20 |
Early-stopping: stop after k consecutive non-significant pathways |
filter_redundant |
True |
Remove redundant pathways via CMI |
redundancy_ratio |
5.0 |
CMI/MI ratio threshold; pathways with all ratios above this are kept |
n_jobs |
1 |
Number of parallel threads |
SingleCellPAGE
| Parameter | Default | Description |
|---|---|---|
function |
'cmi' |
'cmi' or 'mi' |
n_bins |
10 |
Number of bins for expression discretization |
n_neighbors |
ceil(sqrt(n_cells)) |
KNN neighbors (capped at 100) |
connectivity |
None |
Precomputed cell-cell connectivity matrix |
n_jobs |
1 |
Number of parallel threads (0 or None for all available) |
GeneMapper
| Parameter | Default | Description |
|---|---|---|
species |
'human' |
'human' or 'mouse' |
cache_dir |
'~/.pypage/' |
Directory for cached mapping file |
Reproducibility Tips
For deterministic benchmark-style runs:
import numpy as np
np.random.seed(0)
p = PAGE(exp, gs, n_shuffle=100, n_jobs=1)
Tutorials
- Comprehensive Tutorial — End-to-end walkthrough covering all features (GMT, GeneMapper, bulk PAGE, single-cell PAGE)
- Bulk PAGE Tutorial — DESeq2 log fold-change with Hallmark gene sets
- Single-Cell PAGE Tutorial — CRC atlas from CELLxGENE with interactive report
- Single-Cell PAGE (Synthetic) — Detailed walkthrough with synthetic data
Testing
Fast local test profile (default CI profile):
pytest -q -m "not slow and not online"
Full test profile (includes long and network-dependent tests):
PYPAGE_RUN_ONLINE_TESTS=1 pytest -q
Documentation
For full API details, see MANUAL.md.
Citation
Bakulin A, Teyssier NB, Kampmann M, Khoroshkin M, Goodarzi H (2024) pyPAGE: A framework for Addressing biases in gene-set enrichment analysis—A case study on Alzheimer's disease. PLoS Computational Biology 20(9): e1012346. https://doi.org/10.1371/journal.pcbi.1012346
License
MIT
About
pyPAGE was developed in the Goodarzi Lab at UCSF by Artemy Bakulin, Noam B. Teyssier, and Hani Goodarzi.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bio_pypage-0.2.0.tar.gz.
File metadata
- Download URL: bio_pypage-0.2.0.tar.gz
- Upload date:
- Size: 70.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fc1acb62c8f713a0c171264f9bfed33011121dc153b12753a5f3586126bd569
|
|
| MD5 |
a47e3e8bd951f11e085e08e9823e41ae
|
|
| BLAKE2b-256 |
0393e50da9065b7441c000c5d5a5f93b5a8f4658eb906b14dcd3f7979bc9ba2d
|
File details
Details for the file bio_pypage-0.2.0-py3-none-any.whl.
File metadata
- Download URL: bio_pypage-0.2.0-py3-none-any.whl
- Upload date:
- Size: 62.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d628232e360a98f994d1986a506cb909ad32512563370d4409d063ec1cbbd793
|
|
| MD5 |
4c719890def9f3a6bb6ff70df82a3408
|
|
| BLAKE2b-256 |
d68887726e43dff3b790531881b136bd9af96cfb5280fab6fc02f21dfd9231d6
|