Skip to main content

Clinical Phenotype Discovery using Latent Class / Profile Analysis with Automatic Model Selection

Project description

PhenoCluster

A flexible data-driven framework for identifying clinical phenotypes using latent class and profile analysis

PyPI version Python versions MIT License CI Docs


Overview

PhenoCluster is a Python framework for unsupervised discovery of clinical phenotypes from heterogeneous patient data. It implements an end-to-end pipeline: from data preprocessing and latent class identification to outcome association analysis, survival modelling, and multistate transition modelling.

The framework is domain-agnostic and can be applied to any clinical cohort study where the goal is to identify latent patient subgroups and characterise their relationship with clinical outcomes. Users supply a dataset and a YAML configuration file; PhenoCluster handles model selection, phenotype assignment, and downstream inference automatically.

Key capabilities

  • Latent Class / Profile Analysis via the StepMix framework with native support for mixed continuous/categorical data and missing values
  • Automatic model selection using information criteria (BIC, AIC, ICL, CAIC, SABIC) with configurable cluster-size constraints
  • Classification quality assessment with per-phenotype Average Posterior Probability (AvePP) and assignment confidence metrics
  • Outcome association analysis with logistic regression yielding odds ratios, confidence intervals, and FDR-corrected p-values
  • Survival analysis with Cox proportional hazards models producing hazard ratios and log-rank tests
  • Multistate modelling with transition-specific Cox PH analysis, Monte Carlo simulation for state occupation probabilities with confidence interval bands, and clinical pathway enumeration
  • Comprehensive output including an interactive HTML report, forest plots with confidence intervals, Kaplan-Meier and Nelson-Aalen curves, heatmaps, and JSON/CSV data exports

Installation

Requires Python ≥ 3.11

From PyPI

pip install phenocluster

From source

git clone https://github.com/EttoreRocchi/PhenoCluster.git
cd PhenoCluster
pip install -e ".[dev]"

Quick start

1. Generate a configuration file

phenocluster create-config -p complete -o config.yaml

2. Edit the configuration

Open config.yaml and fill in your dataset-specific parameters:

global:
  project_name: "My Study"
  output_dir: "results"
  random_state: 42

data:
  continuous_columns:
    - age
    - bmi
    - lab_value_1
  categorical_columns:
    - sex
    - smoking_status
    - disease_stage
  split:
    test_size: 0.2

outcome:
  enabled: true
  outcome_columns:
    - mortality_30d
    - readmission_30d

survival:
  enabled: true
  targets:
    - name: "overall_survival"
      time_column: "time_to_death"
      event_column: "death_indicator"

3. Run the pipeline

phenocluster run -d data.csv -c config.yaml

4. Inspect results

Results are written to the output directory (default: results/):

File Description
analysis_report.html Comprehensive HTML report with all results and visualisations
cluster_statistics.json Phenotype sizes, feature distributions, and classification quality
outcome_results.json Odds ratios with confidence intervals and p-values
survival_results.json Kaplan-Meier estimates and Cox PH hazard ratios
multistate_results.json Transition-specific hazard ratios, pathways, and state occupation
data/model_fit_metrics.csv Information criteria, entropy, and average posterior probabilities
data/phenotypes_data.csv Original data augmented with phenotype assignments
data/posterior_probabilities.csv Posterior class membership probabilities
results/model_selection_summary.json Model selection comparison table and best model info
results/feature_importance.json Feature characterisation per phenotype
results/validation_report.json Internal validation metrics (train/test comparison)
results/stability_results.json Consensus clustering stability metrics
results/split_info.json Train/test split details
results/external_validation_results.json External validation results (when enabled)
phenocluster.log Pipeline execution log
artifacts/ Cached intermediate results for incremental re-runs

Pipeline overview

PhenoCluster executes the following stages in order:

  1. Data quality assessment. Missingness patterns, correlations, variance, and MCAR testing.
  2. Train/test split. Stratified splitting with configurable test size, performed before preprocessing to prevent data leakage.
  3. Preprocessing. Imputation, outlier handling, categorical encoding, standardization, and feature selection -- fit on training data only, then applied to the test set.
  4. Model selection. Cross-validated information criterion search over cluster counts (training set only).
  5. Full-cohort refit. Once K is selected, preprocessing and LCA/LPA model are refitted on the entire cohort; phenotypes reordered by size (largest = Phenotype 0).
  6. Stability analysis. Consensus clustering over subsampled runs.
  7. Internal validation. Train/test log-likelihood comparison, cluster proportion stability, and outcome OR consistency.
  8. Outcome association. Logistic regression for binary outcomes with FDR-corrected p-values (optional).
  9. Survival analysis. Kaplan-Meier curves, Nelson-Aalen estimators, log-rank tests, and Cox PH hazard ratios (optional).
  10. Multistate modelling. Transition-specific Cox PH models, transition hazard ratios, and Monte Carlo simulation (optional).
  11. Report generation. Interactive HTML report with all figures and tables.

CLI reference

Command Description
phenocluster run -d DATA -c CONFIG [--force-rerun] Run the full pipeline
phenocluster create-config [-p PROFILE] [-o OUTPUT] Generate a config YAML from a profile template
phenocluster validate-config -c CONFIG [-d DATA] Validate config structure; cross-check columns against data
phenocluster version Show version, repository link, and documentation link

Configuration profiles

Profiles set sensible defaults for common use-cases. Generate one with phenocluster create-config -p <profile>:

Profile Description Inference Stability Multistate
descriptive Phenotype discovery only, no statistical inference off on off
complete All analyses enabled (outcomes, survival, multistate) on on on
quick Fast iteration for development on off off

Configuration reference

See the full Configuration Reference in the documentation.

Documentation

Full documentation (statistical methods, configuration reference, output descriptions) is available at ettorerocchi.github.io/PhenoCluster.

Testing

pip install -e ".[dev]"
pytest tests/ -v

License

This project is licensed under the MIT License.

Citation

If you use PhenoCluster in your research, please cite:


Acknowledgment

This project relies on StepMix, a Python package for pseudo-likelihood estimation of generalized mixture models with external variables. We thank the authors for making their work openly available.

If you use this framework, please cite also:

Morin, S., Legault, R., Laliberté, F., Bakk, Z., Giguère, C.-É., de la Sablonnière, R., & Lacourse, É. (2025). StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables. Journal of Statistical Software, 113(8), 1-39. doi: 10.18637/jss.v113.i08

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phenocluster-0.1.0.tar.gz (150.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

phenocluster-0.1.0-py3-none-any.whl (170.9 kB view details)

Uploaded Python 3

File details

Details for the file phenocluster-0.1.0.tar.gz.

File metadata

  • Download URL: phenocluster-0.1.0.tar.gz
  • Upload date:
  • Size: 150.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for phenocluster-0.1.0.tar.gz
Algorithm Hash digest
SHA256 90999373b67dc8fcb2a2e04a7bfd6299c0a82430c11d2364a1a46c4daf53f938
MD5 a4e8997d4d49853d40589e619abce6ba
BLAKE2b-256 7c73589f4931af7b782e12200bea05f454648140bc216f0343686a44fc5b9e2f

See more details on using hashes here.

File details

Details for the file phenocluster-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: phenocluster-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 170.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for phenocluster-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9bb185f92e97d06a01ee0a75d1b568c1f2380e16e70a01fae8bdbd647d4700a9
MD5 e8ecc76107b25920239bcdfdd6a2d14c
BLAKE2b-256 eb72c77cb8b3128bef1bbc07399d67501b14607a6352d7c198813f69157c73f0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page