Clinical Phenotype Discovery using Latent Class / Profile Analysis with Automatic Model Selection
Project description
A flexible data-driven framework for identifying clinical phenotypes using latent class and profile analysis
Overview
PhenoCluster is a Python framework for unsupervised discovery of clinical phenotypes from heterogeneous patient data. It implements an end-to-end pipeline: from data preprocessing and latent class identification to outcome association analysis, survival modelling, and multistate transition modelling.
The framework is domain-agnostic and can be applied to any clinical cohort study where the goal is to identify latent patient subgroups and characterise their relationship with clinical outcomes. Users supply a dataset and a YAML configuration file; PhenoCluster handles model selection, phenotype assignment, and downstream inference automatically.
Key capabilities
- Latent Class / Profile Analysis via the StepMix framework with native support for mixed continuous/categorical data and missing values
- Automatic model selection using information criteria (BIC, AIC, ICL, CAIC, SABIC) with configurable cluster-size constraints
- Classification quality assessment with per-phenotype Average Posterior Probability (AvePP) and assignment confidence metrics
- Outcome association analysis with logistic regression yielding odds ratios, confidence intervals, and FDR-corrected p-values
- Survival analysis with Cox proportional hazards models producing hazard ratios and log-rank tests
- Multistate modelling with transition-specific Cox PH analysis, Monte Carlo simulation for state occupation probabilities with confidence interval bands, and clinical pathway enumeration
- Temporal and multi-site generalizability (v0.3.0) - validate phenotypes across time windows or sites/centers (cutoff, sliding/expanding windows, leave-one-site-out), with apply-only or refit-and-match modes, calibration metrics (Brier, ECE), drift detection (PSI, KS, chi-square), and per-phenotype OR/HR concordance with FDR-corrected delta tests
- Optional Streamlit dashboard (v0.3.0) for interactive exploration of saved results:
phenocluster dashboard <results_dir> - Comprehensive output including an interactive HTML report (toggleable via
generate_html_reportor--no-html-report), forest plots with confidence intervals, Kaplan-Meier and Nelson-Aalen curves, heatmaps, and JSON/CSV data exports
Installation
Requires Python >= 3.11
pip install phenocluster
To enable the optional interactive dashboard:
pip install 'phenocluster[dashboard]'
Quick start
1. Generate a configuration file
phenocluster create-config -p complete -o config.yaml
2. Edit the configuration
Open config.yaml and fill in your dataset-specific parameters:
global:
project_name: "My Study"
output_dir: "results"
random_state: 42
data:
continuous_columns:
- age
- bmi
- lab_value_1
categorical_columns:
- sex
- smoking_status
- disease_stage
split:
test_size: 0.2
outcome:
enabled: true
outcome_columns:
- mortality_30d
- readmission_30d
survival:
enabled: true
targets:
- name: "overall_survival"
time_column: "time_to_death"
event_column: "death_indicator"
3. Run the pipeline
phenocluster run -d data.csv -c config.yaml
4. Inspect results
Results are written to the output directory (default: results/):
| File | Description |
|---|---|
analysis_report.html |
Comprehensive HTML report (skip with generate_html_report: false or --no-html-report) |
cluster_statistics.json |
Phenotype sizes, feature distributions, and classification quality |
outcome_results.json |
Odds ratios with confidence intervals and p-values |
survival_results.json |
Kaplan-Meier estimates and Cox PH hazard ratios |
multistate_results.json |
Transition-specific hazard ratios, pathways, and state occupation |
data/model_fit_metrics.csv |
Information criteria, entropy, and average posterior probabilities |
data/phenotypes_data.csv |
Original data augmented with phenotype assignments |
data/posterior_probabilities.csv |
Posterior class membership probabilities |
results/model_selection_summary.json |
Model selection comparison table and best model info |
results/feature_importance.json |
Feature characterisation per phenotype |
results/validation_report.json |
Internal validation metrics (train/test comparison) |
results/stability_results.json |
Consensus clustering stability metrics |
results/split_info.json |
Train/test split details |
results/external_validation_results.json |
External validation results (when enabled) |
results/temporal_validation_results.json |
Temporal generalizability results (when enabled, v0.3.0) |
results/multisite_validation_results.json |
Multi-site (LOGO / holdout) generalizability results (v0.3.0) |
results/external_cohorts_results.json |
External-CSV generalizability results (v0.3.0) |
results/generalizability_summary.json |
Aggregate ARI / PSI summary across cohorts plus training_scope flag (v0.3.0) |
data/generalizability/ |
Per-cohort cluster_distribution_<label>.csv and drift_<label>.csv (v0.3.0) |
phenocluster.log |
Pipeline execution log |
artifacts/ |
Cached intermediate results for incremental re-runs |
5. Validate phenotypes across time or sites (v0.3.0)
Add a generalizability block to the config to enable temporal, multi-site, and/or external-CSV validation. The default training_scope: per_split fits a fresh preprocessor and StepMix model on the derivation rows of each in-CSV split and applies it to the validation rows. The pipeline's full-cohort model stays untouched for descriptive analyses.
generalizability:
enabled: true
training_scope: per_split # per_split (default) | global
feature_selector_scope: auto # auto (default) | global | per_split
refit: true # refit-and-match Hungarian alignment
min_validation_size_for_refit: 100
temporal:
time_column: admission_date
scheme: cutoff # cutoff | fraction | sliding | expanding
time_cutoff: "2020-12-31"
multisite:
site_column: center
scheme: logo # logo | holdout | pairwise
min_site_size: 30
external_cohorts: # optional, one or more separate CSVs
- { path: ./cohort_B.csv, label: hospital_X, kind: site }
- { path: ./cohort_2024.csv, label: era_2024, kind: temporal }
drift: { enabled: true, n_bins: 10, top_k: 20 }
calibration: { enabled: true, n_bins: 10, strategy: quantile }
outcome_concordance: { enabled: true, fdr_method: bh, alpha: 0.05 }
Each cohort yields a phenotype distribution, drift table, refit-and-match metrics (ARI / NMI / Hungarian-matched accuracy), calibration metrics, and per-phenotype OR/HR concordance with FDR-corrected delta tests. Cohort reports also expose a fit_mode field (per_split for in-CSV splits under the default scope; global for external CSVs and the legacy permissive path) and derivation_only_ari showing how the fresh derivation-only fit compares to the global model.
6. Explore results interactively (v0.3.0)
pip install 'phenocluster[dashboard]'
phenocluster dashboard ./results/
Streamlit launches at http://127.0.0.1:8501 with tabs for an Overview, Phenotypes, Outcomes, Survival, Multistate, Generalizability, and a per-cohort Drift explorer.
Pipeline overview
PhenoCluster executes the following stages in order:
- Data quality assessment. Missingness patterns, correlations, variance, and MCAR testing.
- Train/test split. Stratified splitting with configurable test size, performed before preprocessing to prevent data leakage.
- Preprocessing. Imputation, outlier handling, categorical encoding, standardization, and feature selection -- fit on training data only, then applied to the test set.
- Model selection. Cross-validated information criterion search over cluster counts (training set only).
- Full-cohort refit. Once K is selected, preprocessing and LCA/LPA model are refitted on the entire cohort; phenotypes reordered by size (largest = Phenotype 0).
- Stability analysis. Consensus clustering over subsampled runs.
- Internal validation. Train/test log-likelihood comparison, cluster proportion stability, and outcome OR consistency.
- Outcome association. Logistic regression for binary outcomes with FDR-corrected p-values (optional).
- Survival analysis. Kaplan-Meier curves, Nelson-Aalen estimators, log-rank tests, and Cox PH hazard ratios (optional).
- Multistate modelling. Transition-specific Cox PH models, transition hazard ratios, and Monte Carlo simulation (optional).
- Temporal / multi-site generalizability. Re-evaluate the derivation phenotypes on later time windows, held-out sites, and external CSVs; report ARI / NMI / matched accuracy, calibration, drift, and OR/HR concordance (optional, v0.3.0).
- Report generation. Interactive HTML report with all figures and tables.
CLI reference
| Command | Description |
|---|---|
phenocluster run -d DATA -c CONFIG [--force-rerun] [-v] [-q] [--html-report/--no-html-report] |
Run the full pipeline |
phenocluster create-config [-p PROFILE] [-o OUTPUT] |
Generate a config YAML from a profile template |
phenocluster validate-config -c CONFIG [-d DATA] |
Validate config structure; cross-check columns against data |
phenocluster list-profiles |
List available configuration profile templates |
phenocluster show-profile NAME |
Print the resolved YAML for a profile with syntax highlighting |
phenocluster dashboard RESULTS_DIR [--port 8501] [--host 127.0.0.1] [--headless/--browser] |
Launch the optional Streamlit dashboard (requires pip install 'phenocluster[dashboard]') |
phenocluster version |
Show version, repository link, and documentation link |
Configuration profiles
Profiles set sensible defaults for common use-cases. Generate one with phenocluster create-config -p <profile>:
| Profile | Description | Inference | Stability | Multistate |
|---|---|---|---|---|
descriptive |
Phenotype discovery only, no statistical inference | off | on | off |
complete |
All analyses enabled (outcomes, survival, multistate) | on | on | on |
quick |
Fast iteration for development | on | off | off |
Configuration reference
See the full Configuration Reference in the documentation.
Documentation
Full documentation (statistical methods, configuration reference, output descriptions) is available at ettorerocchi.github.io/phenocluster.
License
This project is licensed under the MIT License.
Citation
If you use PhenoCluster in your research, please cite:
Available soon.
Acknowledgment
This project relies on StepMix, a Python package for pseudo-likelihood estimation of generalized mixture models with external variables. We thank the authors for making their work openly available.
If you use this framework, please cite also:
Morin, S., Legault, R., Laliberté, F., Bakk, Z., Giguère, C.-É., de la Sablonnière, R., & Lacourse, É. (2025). StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables. Journal of Statistical Software, 113(8), 1-39. doi: 10.18637/jss.v113.i08
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file phenocluster-0.3.0.tar.gz.
File metadata
- Download URL: phenocluster-0.3.0.tar.gz
- Upload date:
- Size: 208.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15f46931e7c18f90aa688dd7e5e1b267db8abee1a12f6c407a40254c752f446d
|
|
| MD5 |
84452a2b3fb3926639b2fbbda80e2f08
|
|
| BLAKE2b-256 |
3ca98c0dd4e0c7d4cfbd60feaf466c7be06c99333f371f78a492625ce1e1a309
|
File details
Details for the file phenocluster-0.3.0-py3-none-any.whl.
File metadata
- Download URL: phenocluster-0.3.0-py3-none-any.whl
- Upload date:
- Size: 264.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
160c61c993af53426492f4967aedeca6019a10bb08c20e9e5188d84be7d49bb3
|
|
| MD5 |
e4f9a5b9aba5ad5dac7d9a4554ff6b07
|
|
| BLAKE2b-256 |
60a04bc1cbc20438ed10e218dcbb98e4c371616f8ac4b4cb84af81071efc9722
|