A disease-agnostic framework for identifying molecular subtypes through pathway-based analysis of rare genetic variants
Project description
Pathway Subtyping Framework
A Disease-Agnostic Tool for Pathway-Based Molecular Subtype Discovery
Overview
The Pathway Subtyping Framework is an open-source computational tool for identifying molecular subtypes in genetically heterogeneous diseases. Instead of analyzing individual genes, it aggregates rare variant burden at the biological pathway level, enabling:
- Better signal detection across genetically diverse cohorts
- Identification of biologically coherent patient subgroups
- Cross-cohort validation of discovered subtypes
Originally developed for autism research, this generalized version can be adapted for any disease with:
- Genetic heterogeneity (many implicated genes)
- Convergent pathway biology
- Available exome/genome sequencing data
Supported Disease Areas
| Disease | Status | Pathway File |
|---|---|---|
| Autism Spectrum Disorder | Validated | autism_pathways.gmt |
| Schizophrenia | Template | schizophrenia_pathways.gmt |
| Epilepsy | Template | epilepsy_pathways.gmt |
| Intellectual Disability | Template | intellectual_disability_pathways.gmt |
| Parkinson's Disease | Template | parkinsons_pathways.gmt |
| Bipolar Disorder | Template | bipolar_pathways.gmt |
| Your disease | Adapt it → | your_pathways.gmt |
Key Features
| Feature | Description |
|---|---|
| Pathway Scoring | Aggregate gene burdens across biological pathways |
| Expression Scoring | Bulk RNA-seq pathway scoring via ssGSEA, GSVA, or mean-Z methods |
| Single-Cell Scoring | Per-cell and pseudobulk pathway scoring from scRNA-seq (h5ad/CSV) |
| Multi-Omic Fusion | Fuse VCF + expression + single-cell scores (concatenate, weighted, intersection) |
| Bulk Deconvolution | Estimate cell-type proportions from bulk RNA-seq via NNLS; cell-type-aware subtypes |
| Multiple Clustering | GMM, K-means, Hierarchical, Spectral with cross-validation |
| Ancestry Correction | PCA-based population stratification correction with independence testing |
| Batch Correction | ComBat-style batch effect detection and correction |
| Sensitivity Analysis | Parameter robustness testing across algorithms, features, normalization |
| Threshold Calibration | Data-driven validation thresholds that adjust for sample size and cluster count |
| Variant QC | QUAL, call rate, HWE, MAF filters before burden computation |
| Validation Gates | 5 gates: negative controls, bootstrap stability, ancestry independence, cross-modal concordance |
| Statistical Rigor | FDR correction, effect sizes, confidence intervals |
| Power Analysis | Sample size recommendations, Type I error estimation |
| Simulation | Synthetic data generation with ground truth for validation |
| Cross-Cohort Validation | Transfer learning and projection-based replication testing |
| Visualization | Interactive Plotly HTML reports, UMAP/t-SNE scatter plots, radar charts, multi-format export |
| Molecular QC | 12-feature manufacturing QC: cascade detection, dosage gating, off-target activation, drift detection, stress fingerprinting |
| Knowledge Graph | Topology-aware pathway scoring, hierarchical queries, cross-omics entity resolution, pathway crosstalk quantification |
| Uncertainty Quantification (v0.6) | Split-conformal prediction intervals, bootstrap, Bayesian pathway-GMM drop-in, calibration diagnostics |
| Cross-Platform Harmonization (v0.6) | UCE-anchored aligner for 10x / Smart-seq2 / bulk / spatial, with per-cell confidence scoring |
| KG Refresh Infrastructure (v0.6) | Pinned source manifest (OmniPath / SIGNOR / Reactome), diff + regression tools, reproducibility hashing |
| AlphaMissense-modulated Cascade (v0.6) | Variant-carrier down-weighting for pathway-cascade scoring |
| In-silico Perturbation (v0.6) | Geneformer wrapper + MSV-from-embedding head + batch screens with F1 uncertainty intervals |
| scGPT / Nicheformer Embeddings (v0.6) | Backend-pluggable cell embedders with content-hashed cache + joint dissociated/spatial analysis |
| Regulatory Gene-Set Expansion (v0.6) | Borzoi or co-expression-backed suggestion tool for custom seed sets |
| CRISPR Sequence Off-Target (v0.6) | Evo 2 sequence-level scoring complementing pathway-level off-target detection |
| Multi-Omics Fusion (v0.6) | RNA + ATAC + proteomics weighted fusion with RNA/protein discordance flagger |
| Causal Inference (v0.6) | Invariant causal prediction identifies pathway parents across environments |
| Active Learning (v0.6) | Uncertainty / diversity / hybrid sample selection under a fixed label budget |
| GNN Embeddings | TransE/RotatE KG embeddings, OntologyAwareGNN with biological attention, gene risk classification (experimental) |
| Autism Interpretation | Neuro-symbolic rules (R1-R7), therapeutic hypothesis ranking with safety flags (autism-only) |
| Performance | tqdm progress bars, chunked VCF processing, 10K+ sample support |
| Reproducibility | Deterministic execution, pinned dependencies, Docker |
| Config-Driven | YAML configuration for all parameters |
Quick Start
Installation
pip install pathway-subtyping
Optional extras:
pip install pathway-subtyping[vcf] # VCF file processing (pysam)
pip install pathway-subtyping[viz] # Interactive visualizations (Plotly, UMAP)
pip install pathway-subtyping[sc] # Single-cell support (AnnData)
pip install pathway-subtyping[graph] # Network analysis (NetworkX, py4cytoscape)
pip install pathway-subtyping[qc] # Molecular QC layer for manufacturing
pip install pathway-subtyping[harmonize] # v0.6 F2 — UCE cross-platform harmonization
pip install pathway-subtyping[perturb] # v0.6 F5 — Geneformer in-silico perturbation
pip install pathway-subtyping[embed] # v0.6 F6/F8 — scGPT + Nicheformer embedders
pip install pathway-subtyping[genesets] # v0.6 F7 — Borzoi regulatory gene-set expansion
pip install pathway-subtyping[qc-sequence] # v0.6 F9 — Evo 2 CRISPR off-target sequence scoring
pip install pathway-subtyping[gnn] # Graph neural networks (PyTorch) — experimental
pip install pathway-subtyping[autism] # Autism-specific interpretation (pure Python)
pip install pathway-subtyping[all] # Everything
The v0.6 foundation-model extras (harmonize, perturb, embed,
genesets, qc-sequence) each install the PyTorch substrate that the
corresponding Official*Backend needs (torch>=2.0; perturb additionally
pulls transformers>=4.35). The model-specific upstream package
(geneformer, scgpt, nicheformer, borzoi, evo2, uce) and its
checkpoint must be installed separately because not all of them ship
stable PyPI wheels yet. Every wrapper also ships a deterministic PCA-based
fallback that works with none of these extras, so tests and local smoke
runs don't depend on heavyweight checkpoints.
Network Visualization (Cytoscape)
The [graph] extra enables publication-ready network figure generation via
Cytoscape, an open-source desktop application for
network visualization. The py4cytoscape library communicates with Cytoscape's
CyREST API on localhost:1234.
Setup:
- Download and install Cytoscape desktop (v3.10+)
- Launch Cytoscape and wait for it to fully load
- Install the Python extra:
pip install pathway-subtyping[graph] - Verify the connection:
python -c "import py4cytoscape as p4c; print(p4c.cytoscape_ping())"
- Run the figure generator:
python scripts/generate_cytoscape_figures.py
Try in Browser (No Installation)
60-second demo — generates a synthetic cohort, discovers subtypes, validates them, and visualizes results. No data needed. Click the badge above to launch in Binder.
Full tutorial: 01_getting_started.ipynb (Binder) | Local
Notebooks
21 Jupyter notebooks covering tutorials through full manuscript reproduction. See docs/notebook-guide.md for execution order and dependencies.
Tutorials (00-09) -- Synthetic data, standalone, any order:
| # | Topic |
|---|---|
| 00 | Quick demo (60 seconds) |
| 01 | Getting started (installation, pipeline, validation) |
| 02 | Expression scoring (ssGSEA, GSVA, mean-Z) |
| 03 | Multi-omic fusion |
| 04 | Cell-type deconvolution |
| 05 | Visualization (PCA, t-SNE, UMAP, Plotly) |
| 06 | Ancestry and batch correction |
| 07 | Sensitivity analysis |
| 08 | Subtype characterization |
| 09 | Signaling database integration |
Real Data Validation (10-19) -- GEO/TCGA datasets, run in order (later notebooks use earlier outputs):
| # | Dataset | Tissue | N | Manuscript Section |
|---|---|---|---|---|
| 10 | GSE28521 | Brain (frontal + temporal cortex) | 79 | Section 5 |
| 11 | GSE64018 | Brain (temporal cortex, RNA-seq) | 24 | Section 5.9 |
| 12 | GSE80655 | Brain (3 regions, multi-diagnosis) | 281 | Section 6 |
| 12b | GSE80655 | Null ARI permutation test | 141 | Section 6.5 |
| 13 | GSE111175 | Blood + ADOS clinical scores | 141 | Section 6.10 |
| 14 | GSE18123 | Blood (largest cohort, 2 platforms) | 285 | Section 6.10 |
| 15 | GSE53987 | Brain (3 regions, Affymetrix, 4 diagnoses) | 205 | Section 6.11 |
| 16 | Multi-dataset | Knowledge graph (STRING PPI + DGIdb) | 1,075 | Section 7 |
| 17 | TCGA-COAD | Colon adenocarcinoma (452 tumors, CMS + survival) | 452 | Section 7 |
| 18 | GSE15402 | LCL (ADI-R clinical phenotype) | 116 | Section 6.10 |
| 19 | Multi-GEO | SCZ blood multi-cohort (Hertzberg replication) | 407 | Section 6.11 |
Run with Sample Data
# Clone for sample data and configs
git clone https://codeberg.org/pathways/pathway-subtyping-framework.git
cd pathway-subtyping-framework
# Run the pipeline with synthetic test data
psf --config configs/test_synthetic.yaml
# View results
cat outputs/synthetic_test/report.md
Run with Your Data
# Copy and customize a config
cp configs/example_autism.yaml configs/my_analysis.yaml
# Edit paths in my_analysis.yaml, then run
psf --config configs/my_analysis.yaml
Docker
# Pull from Docker Hub
docker pull rohitdataops/pathway-subtyping:latest # CLI runtime
docker pull rohitdataops/pathway-subtyping:0.6.0-jupyter # Jupyter + notebooks
# Run pipeline
docker compose run pipeline
# Run tests (1612 tests on the public edition)
docker compose run test
# Start Jupyter notebook
docker compose up jupyter
# Open http://localhost:8888
Adapting for Your Disease
- Create a pathway GMT file with disease-relevant gene sets
- Copy an example config and point to your data
- Run the pipeline — validation gates will tell you if subtypes are meaningful
See the full guide: Adapting for Your Disease
How It Works
VCF / Expression / scRNA-seq → Pathway Scoring → [Ancestry Correction] → [Batch Correction] → GMM Clustering → [Sensitivity Analysis] → Validation → Report
5-Layer Architecture:
| Layer | Extra | What It Adds |
|---|---|---|
| Core | (none) | GMT scoring, GMM clustering, validation gates, simulation |
| Graph | [graph] |
KG builder, network propagation, topology-weighted scoring |
| QC | [qc] |
12-feature manufacturing QC (cascade, dosage, drift, off-target, stress) |
| GNN | [gnn] |
TransE/RotatE embeddings, OntologyAwareGNN, biological attention (experimental) |
| Autism | [autism] |
Neuro-symbolic rules (R1-R7), therapeutic hypothesis ranking (autism-only) |
1. Pathway Scoring
Multiple input modalities are supported, each producing the same Z-normalized pathway score matrix:
- VCF: Rare damaging variants aggregated with LoF/CADD weights
- Expression: Bulk RNA-seq scored via ssGSEA, GSVA, or mean-Z
- Single-cell: Pseudobulk or per-cell scoring from scRNA-seq
- Multi-omic: Fuse scores from multiple modalities for unified subtype discovery
2. Subtype Discovery
Multiple clustering algorithms identify patient subgroups:
- GMM (default): Soft assignments, automatic selection via BIC
- K-means: Fast, spherical clusters
- Hierarchical: Dendogram-based, no K required
- Spectral: Nonlinear boundaries
- Cross-validation for stability assessment
- Algorithm comparison with pairwise ARI
3. Validation Gates
Built-in tests prevent overfitting:
- Label shuffle: Randomized labels should NOT cluster (ARI < 0.15)
- Random genes: Fake pathways should NOT work (ARI < 0.15)
- Bootstrap: Clusters should be stable under resampling (ARI > 0.8)
- Ancestry independence: Clusters should not correlate with ancestry PCs (when provided)
- Cross-modal concordance: Subtypes should replicate across data modalities (when multi-omic)
4. Statistical Rigor
Publication-quality statistics:
- FDR correction: Benjamini-Hochberg for multiple testing
- Effect sizes: Cohen's d with 95% bootstrap confidence intervals
- Power analysis: Sample size recommendations for target effect sizes
- Type I error: Estimation via null simulations
See docs/how-it-works.md for a plain-language conceptual guide, or docs/METHODS.md for full statistical methodology.
Data Requirements
| Input | Format | Notes |
|---|---|---|
| Variants | VCF | Annotated with gene symbols, consequences |
| Bulk Expression | CSV/TSV | Gene expression matrix (samples x genes) |
| Single-Cell | h5ad/CSV | AnnData or cell-by-gene matrix with cell type annotations |
| Phenotypes | CSV | Sample IDs + clinical features |
| Pathways | GMT | Gene sets for your disease |
Your data stays on your infrastructure. The framework runs locally or in your cloud environment.
Data Provenance and Integrity
This project contains zero proprietary, commercial, or third-party customer data.
Every data file in this repository was either:
- Computationally generated — The synthetic VCF and phenotype files in
data/sample/were created by ourSyntheticDataGeneratorusing random number generators with fixed seeds. They contain no real patient or clinical data whatsoever. - Curated from public scientific literature — The pathway GMT files in
data/pathways/contain gene symbol lists assembled exclusively from publicly available, peer-reviewed sources: SFARI Gene, KEGG, Reactome, MSigDB, and Gene Ontology. Gene symbols (e.g., SHANK3, CHD8) are standard scientific identifiers published in thousands of research papers. - Open-source code only — All algorithms are original implementations or standard open-source libraries (scikit-learn, scipy, numpy, pandas). No proprietary software, commercial code, or licensed algorithms were used.
No data from any employer, client, institution, or commercial entity was used at any stage of this project — not in development, testing, validation, or documentation. The framework is designed so that users supply their own data; it does not ship with, embed, or depend on any private or restricted datasets.
For full details, see DISCLAIMER.md and docs/contributor-kit/04-research-compliance.md.
Project Structure
pathway-subtyping-framework/
├── src/pathway_subtyping/ # Core Python package
│ ├── pipeline.py # Main pipeline orchestrator
│ ├── clustering.py # GMM, K-means, Hierarchical, Spectral
│ ├── validation.py # Validation gates (5 gates)
│ ├── statistical_rigor.py # FDR, effect sizes, burden weights
│ ├── simulation.py # Synthetic data & power analysis
│ ├── expression.py # Bulk RNA-seq pathway scoring (ssGSEA, GSVA, mean-Z)
│ ├── single_cell.py # Single-cell scRNA-seq scoring (pseudobulk + per-cell)
│ ├── multi_omic.py # Multi-omic pathway score fusion
│ ├── deconvolution.py # Bulk deconvolution (NNLS cell-type proportions)
│ ├── cross_modal_validation.py # Cross-modal concordance gate (Gate 5)
│ ├── visualization.py # Interactive Plotly reports, UMAP/t-SNE, export
│ ├── characterization.py # Subtype profiling, heatmaps, gene contributions
│ ├── ancestry.py # Population stratification correction
│ ├── batch_correction.py # Batch effect detection & correction
│ ├── sensitivity.py # Parameter sensitivity analysis
│ ├── benchmark.py # Method comparison benchmarks
│ ├── cross_cohort.py # Cross-cohort validation
│ ├── threshold_calibration.py # Data-driven threshold calibration
│ ├── network_propagation.py # PPI network propagation (RWR, PageRank)
│ ├── variant_qc.py # Variant QC (QUAL, HWE, MAF, call rate)
│ ├── validation_datasets.py # ClinVar/Reactome integration
│ ├── data_quality.py # VCF quality checks
│ ├── utils/ # Performance, seeding, progress tracking
│ ├── knowledge_graph/ # [graph] KG builder, schema, exporters
│ ├── qc/ # [qc] 12-feature molecular QC layer
│ ├── gnn/ # [gnn] GNN embeddings, attention, model (experimental)
│ └── autism/ # [autism] Neuro-symbolic rules, therapeutic ranking
├── configs/ # Example YAML configurations
├── data/
│ ├── pathways/ # Pathway GMT files (6 diseases)
│ └── sample/ # Synthetic test data
├── docs/
│ ├── METHODS.md # Statistical methods documentation
│ ├── api/ # API reference (22+ modules)
│ └── guides/ # User guides
├── scripts/ # Utility scripts
│ ├── generate_cytoscape_figures.py # Publication-ready network figures (requires [graph] + Cytoscape desktop)
│ ├── validate_with_public_data.py # ClinVar/Reactome validation
│ └── benchmark_performance.py # Performance benchmarks
├── examples/notebooks/ # Jupyter tutorials
├── tests/ # Test suite (1363 tests)
├── Dockerfile # Container support
└── docker-compose.yml # Easy orchestration
Development
# Install with dev dependencies (from cloned repo)
pip install -e ".[dev,vcf,viz,sc,graph,qc,autism]"
# Run tests
pytest tests/ -v
# Run linting
black src/ tests/
isort src/ tests/
flake8 src/ tests/
# Set up pre-commit hooks
pre-commit install
Related Projects
- Autism Pathway Framework — The original autism-focused implementation with SFARI cohort validation
Contributing
Contributions welcome! Areas where help is needed:
- Additional disease pathway definitions
- Multi-omic integration (spatial transcriptomics, proteomics)
- Documentation and tutorials
See CONTRIBUTING.md for guidelines.
Citation
If you use this framework in your research, please cite the preprint:
Chauhan R. Pathway-Based Molecular Subtyping Reveals a GABA-Collapsed Autism Subgroup
and Cross-Disease Convergence with Schizophrenia in Human Cerebral Cortex.
Research Square. 2026. DOI: 10.21203/rs.3.rs-8913089/v1
https://www.researchsquare.com/article/rs-8913089/v1
To cite the software specifically:
Chauhan R. Pathway Subtyping Framework. Zenodo. 2026.
DOI: 10.5281/zenodo.18442426
https://codeberg.org/pathways/pathway-subtyping-framework
For autism-specific work, also cite:
Chauhan R. Autism Pathway Framework. Zenodo. 2026.
DOI: 10.5281/zenodo.18403844
License
MIT License — see LICENSE for details.
Contact
Rohit Chauhan
- Email: info@topmist.com
- Codeberg: @pathways
- ORCID: 0009-0003-9895-4629
RESEARCH USE ONLY — This framework is for hypothesis generation. Not for clinical diagnosis or treatment decisions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pathway_subtyping-0.6.3.tar.gz.
File metadata
- Download URL: pathway_subtyping-0.6.3.tar.gz
- Upload date:
- Size: 670.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4cb6f078d051fb9e5951916ae756c401251b6e519b67e9f663455ecbf593294
|
|
| MD5 |
2f4befb59924d9be9330786dd0ea19b5
|
|
| BLAKE2b-256 |
f27f44ff028fb7c1ce6b0096b69b5de14f00770821c6fbaf0ce06e8f6281ef6d
|
File details
Details for the file pathway_subtyping-0.6.3-py3-none-any.whl.
File metadata
- Download URL: pathway_subtyping-0.6.3-py3-none-any.whl
- Upload date:
- Size: 382.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e91bfb3af1d7d17cf90ec2f9a0c457418d9ba3a25641fcc42800e23fabf0aa49
|
|
| MD5 |
270a1864063951b6c94eb60af7f14841
|
|
| BLAKE2b-256 |
3e166d66dbba591634708bbc159e91e2d9a2d03abc4e7e5a01a975980d551d26
|