A disease-agnostic framework for identifying molecular subtypes through pathway-based analysis of rare genetic variants
Reason this release was yanked:
"Breaks Colab/NumPy 2.x — use 0.2.2"
Project description
Pathway Subtyping Framework
A Disease-Agnostic Tool for Pathway-Based Molecular Subtype Discovery
Overview
The Pathway Subtyping Framework is an open-source computational tool for identifying molecular subtypes in genetically heterogeneous diseases. Instead of analyzing individual genes, it aggregates rare variant burden at the biological pathway level, enabling:
- Better signal detection across genetically diverse cohorts
- Identification of biologically coherent patient subgroups
- Cross-cohort validation of discovered subtypes
Originally developed for autism research, this generalized version can be adapted for any disease with:
- Genetic heterogeneity (many implicated genes)
- Convergent pathway biology
- Available exome/genome sequencing data
Supported Disease Areas
| Disease | Status | Pathway File |
|---|---|---|
| Autism Spectrum Disorder | Validated | autism_pathways.gmt |
| Schizophrenia | Template | schizophrenia_pathways.gmt |
| Epilepsy | Template | epilepsy_pathways.gmt |
| Intellectual Disability | Template | intellectual_disability_pathways.gmt |
| Parkinson's Disease | Template | parkinsons_pathways.gmt |
| Bipolar Disorder | Template | bipolar_pathways.gmt |
| Your disease | Adapt it → | your_pathways.gmt |
Key Features
| Feature | Description |
|---|---|
| Pathway Scoring | Aggregate gene burdens across biological pathways |
| Multiple Clustering | GMM, K-means, Hierarchical, Spectral with cross-validation |
| Ancestry Correction | PCA-based population stratification correction with independence testing |
| Batch Correction | ComBat-style batch effect detection and correction |
| Sensitivity Analysis | Parameter robustness testing across algorithms, features, normalization |
| Validation Gates | Negative controls + bootstrap stability + ancestry independence testing |
| Statistical Rigor | FDR correction, effect sizes, confidence intervals |
| Power Analysis | Sample size recommendations, Type I error estimation |
| Simulation | Synthetic data generation with ground truth for validation |
| Reproducibility | Deterministic execution, pinned dependencies, Docker |
| Config-Driven | YAML configuration for all parameters |
Quick Start
Installation
# Clone the repository
git clone https://github.com/topmist-admin/pathway-subtyping-framework
cd pathway-subtyping-framework
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
# Install the package
pip install -e .
# Verify installation
psf --version
Run with Sample Data
# Run the pipeline with synthetic test data
psf --config configs/test_synthetic.yaml
# View results
cat outputs/synthetic_test/report.md
Run with Your Data
# Copy and customize a config
cp configs/example_autism.yaml configs/my_analysis.yaml
# Edit paths in my_analysis.yaml, then run
psf --config configs/my_analysis.yaml
Try in Browser (No Installation)
Docker
# Run pipeline
docker-compose run pipeline
# Run tests
docker-compose run test
# Start Jupyter notebook
docker-compose up jupyter
# Open http://localhost:8888
Adapting for Your Disease
- Create a pathway GMT file with disease-relevant gene sets
- Copy an example config and point to your data
- Run the pipeline — validation gates will tell you if subtypes are meaningful
See the full guide: Adapting for Your Disease
How It Works
VCF Input → Variant Filter → Gene Burden → Pathway Aggregation → [Ancestry Correction] → [Batch Correction] → GMM Clustering → [Sensitivity Analysis] → Validation → Report
1. Pathway Scoring
Rare damaging variants are aggregated into pathway-level disruption scores:
- Loss-of-function variants weighted higher
- Missense variants weighted by CADD score
- Scores normalized across samples
2. Subtype Discovery
Multiple clustering algorithms identify patient subgroups:
- GMM (default): Soft assignments, automatic selection via BIC
- K-means: Fast, spherical clusters
- Hierarchical: Dendogram-based, no K required
- Spectral: Nonlinear boundaries
- Cross-validation for stability assessment
- Algorithm comparison with pairwise ARI
3. Validation Gates
Built-in tests prevent overfitting:
- Label shuffle: Randomized labels should NOT cluster (ARI < 0.15)
- Random genes: Fake pathways should NOT work (ARI < 0.15)
- Bootstrap: Clusters should be stable under resampling (ARI > 0.8)
- Ancestry independence: Clusters should not correlate with ancestry PCs (when provided)
4. Statistical Rigor
Publication-quality statistics:
- FDR correction: Benjamini-Hochberg for multiple testing
- Effect sizes: Cohen's d with 95% bootstrap confidence intervals
- Power analysis: Sample size recommendations for target effect sizes
- Type I error: Estimation via null simulations
See docs/METHODS.md for full statistical methodology.
Data Requirements
| Input | Format | Notes |
|---|---|---|
| Variants | VCF | Annotated with gene symbols, consequences |
| Phenotypes | CSV | Sample IDs + clinical features |
| Pathways | GMT | Gene sets for your disease |
Your data stays on your infrastructure. The framework runs locally or in your cloud environment.
Project Structure
pathway-subtyping-framework/
├── src/pathway_subtyping/ # Core Python package
│ ├── pipeline.py # Main pipeline
│ ├── clustering.py # Multiple clustering algorithms
│ ├── statistical_rigor.py # FDR, effect sizes, burden weights
│ ├── simulation.py # Synthetic data & power analysis
│ ├── validation.py # Validation gates
│ ├── ancestry.py # Population stratification correction
│ ├── batch_correction.py # Batch effect detection & correction
│ ├── sensitivity.py # Parameter sensitivity analysis
│ └── data_quality.py # VCF quality checks
├── configs/ # Example YAML configurations
├── data/
│ ├── pathways/ # Pathway GMT files (6 diseases)
│ └── sample/ # Synthetic test data
├── docs/
│ ├── METHODS.md # Statistical methods documentation
│ └── guides/ # User guides
├── examples/notebooks/ # Jupyter tutorials
├── tests/ # Test suite (347 tests)
├── Dockerfile # Container support
└── docker-compose.yml # Easy orchestration
Development
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run linting
black src/ tests/
isort src/ tests/
flake8 src/ tests/
# Set up pre-commit hooks
pre-commit install
Related Projects
- Autism Pathway Framework — The original autism-focused implementation with SFARI cohort validation
Contributing
Contributions welcome! Areas where help is needed:
- Additional disease pathway definitions
- Performance optimization for large cohorts
- Documentation and tutorials
See CONTRIBUTING.md for guidelines.
Citation
If you use this framework, please cite:
Chauhan R. Pathway Subtyping Framework. GitHub. 2026.
https://github.com/topmist-admin/pathway-subtyping-framework
For autism-specific work, also cite:
Chauhan R. Autism Pathway Framework. Zenodo. 2026.
DOI: 10.5281/zenodo.18403844
License
MIT License — see LICENSE for details.
Contact
Rohit Chauhan
- Email: info@topmist.com
- GitHub: @topmist-admin
RESEARCH USE ONLY — This framework is for hypothesis generation. Not for clinical diagnosis or treatment decisions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pathway_subtyping-0.2.1.tar.gz.
File metadata
- Download URL: pathway_subtyping-0.2.1.tar.gz
- Upload date:
- Size: 191.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
07282add8a4f9c63536f7f831e479beda7ea3400d6a8bf15be21f437401bb99b
|
|
| MD5 |
d42364ff99bc5b85fd177a6742611c5d
|
|
| BLAKE2b-256 |
4bebdc6dbe86ab30368f5588a8b83390d2490ef51e61ac280fa466de449808cd
|
File details
Details for the file pathway_subtyping-0.2.1-py3-none-any.whl.
File metadata
- Download URL: pathway_subtyping-0.2.1-py3-none-any.whl
- Upload date:
- Size: 81.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d06d72935a8247bc290f396dd36cd08dc1e63355642f3c374006cd1744d9a52
|
|
| MD5 |
738fca4a4fa6c8e6a881ba4242ec4d5a
|
|
| BLAKE2b-256 |
f967c4d84282b93f696a80ea178f31e24e70a13b5811e33eb58f0f21b6035682
|