Stability-based relative clustering validation algorithm for neuroimaging data
Project description
NeuReval
A stability-based relative clustering validation method to determine the best number of clusters based on neuroimaging data.
Table of contents
1. Project overview
NeuReval implements a stability-based relative clustering approach within a cross-validation framework to identify the clustering solution that best replicates on unseen data. Compared to commonly used internal measures that rely on the inherent characteristics of the data, this approach has the advantage to identify clusters that are robust and reproducible in other samples of the same population. NeuReval is based on reval Python package (https://github.com/IIT-LAND/reval_clustering) and extends its application to neuroimaging data. For more details about the theoretical background of reval, please see Landi et al. (2021).
This package allows to:
- Select any classification algorithm from sklearn library;
- Select a clustering algorithm with n_clusters parameter (i.e., KMeans, AgglomerativeClustering, and SpectralClustering), Gaussian Mixture Models with n_components parameter, and HDBSCAN density-based algorithm;
- Perform (repeated) k-fold cross-validation to determine the best number of clusters;
- Test the final model on an held-out dataset.
The following changes were made to reval to be performed on neuroimaging data:
- Standardization and covariates adjustement within cross-validation;
- Combine different kind of neuroimaging data and apply different set of covariates to each neuroimaging modality;
- Implementation of data reduction techniques (e.g., PCA, UMAP) and optimization of their parameters within cross-validation.
2. Installation and Requirements
Work in progress
3. How to use NeuReval
i. Input structure
NeuReval requires that input features and covariates are organized as file excel in the following way:
for database with input features (database.xlsx):
- First column: subject ID
- Second column: diagnosis (e.g., patients=1, healthy controls=0). In case NeuReval is run on a single diagnostic group, provide a costant value for all subjects.
- From the third column: features
Notes: in case you want to combine with difffusion tensor imaging (DTI) extracted tract-based features, please add them after all the other neuroimaging features. The first DTI feature should be "ACR".
Example of database structure for input features:
Subject_ID | Diagnosis | Feature_01 | Feature_02 |
---|---|---|---|
sub_0001 | 0 | 0.26649221 | 2.13888054 |
sub_0002 | 1 | 0.32667590 | 0.67116539 |
sub_0003 | 0 | 0.35406757 | 2.35572978 |
for database with covariates (covariates.xlsx):
- First column: subject ID
- Second column: diagnosis (e.g., patients=1, healhty controls=0). In case NeuReval is run on a single diagnostic group, provide a costant value for all subjects.
- From the third column: covariates
Notes: if you want to correct neuroimaging features also for total intracranial volume (TIV), please add it as the last column of the database.
Example of database structure for covariates:
Subject_ID | Diagnosis | Age | Sex | TIV |
---|---|---|---|---|
sub_0001 | 0 | 54 | 0 | 1213.76 |
sub_0002 | 1 | 37 | 1 | 1372.93 |
sub_0003 | 0 | 43 | 0 | 1285.88 |
Templates for both datasets are provided in the folder NeuReval/example_data.
ii. Grid-search cross-validation for parameters' tuning
First, parameters for fixed classifier/clustering/preprocessing algorithms can be optimized through a grid-search cross-validation. This can be done with the ParamSelectionConfounds
class:
ParamSelectionConfounds(params, cv, s, c, preprocessing, nrand=10, n_jobs=-1, iter_cv=1, strat=None, clust_range=None, combined_data=False)
Parameters to be specified:
- params: dictionary of dictionaries of the form {‘s’: {classifier parameter grid}, ‘c’: {clustering parameter grid}} including the lists of classifiers and clustering methods to fit to the data. In case you want to optimize also preprocessing parameters (e.g., PCA or UMAP components), specify {'preprocessing':{preprocessing parameter grid}} within the dictionary.
- cv: cross-validation folds
- s: classifier object
- c: clustering object
- preprocessing: data reduction algorithm object
- nrand: number of random labelling iterations, default 10
- n_jobs: number of jobs to run in parallel, default (number of cpus - 1)
- iter_cv: number of repeated cross-validation, default 1
- clust_range: list with number of clusters, default None
- strat: stratification vector for cross-validation splits, default
None
- combined_data: define whether multimodal data are used as input features. If
True
, different sets of covariates will be applied for each modality (e.g. correction for TIV only for grey matter features). DefaultFalse
Once the ParamSelectionConfounds
class is initialized, the fit(data_tr, cov_tr, nclass=None)
class method can be used to run grid-search cross-validation.
It returns the optimal number of clusters (i.e., minimum normalized stability), the corresponding normalized stability, and the selected classifier/clustering/preprocessing parameters.
iii. Run NeuReval with opitmized clustering/classifier/preprocessing algorithms
After the selection of the best clustering/classifier/preprocessing parameters through grid-search cross-vallidation, we can initalize the FindBestClustCVConfounds
class to assess the normalized stability associated to the best clustering solution and the corresponding clusters' labels
FindBestClustCVConfounds(s, c, preprocessing=None, nrand=10, nfold=2, n_jobs=-1, nclust_range=None)
Parameters to be specified:
- s: classifier object (with opitmized parameters)
- c: clustering object (with optimized parameters)
- preprocessing: data reduction algorithm object (with optimized parameters), default None
- nrand: number of random labelling iterations, default 10
- nfold: number of cross-validation folds, default 2
- n_jobs: number of jobs to run in parallel, default (number of cpus - 1)
- clust_range: list with number of clusters, default None
Once the class has been initialized, the best_nclust_confounds(data, covariates, iter_cv=10, strat_vect=None, combined_data=False)
method can be used to obtain the normalized stability, the number of clusters associated to the optimal clustering solution, and clusters' labels. It returns:
- metrics: normalized stability
- bestncl: best number of clusters
- tr_lab: clusters' labels
iv. Compute internal measures
Together with normalized stability, NeuReval also allows to compute internal measures for comparisons between the stability-based relative validation and internal validation approaches. This can be done with the neureval.internal_baselines_confounds
method and the function select_best
to select the best number of clusters that maximize/minimize the selected internal measure:
neureval.internal_baselines_confounds.select_best(data, covariates, c, int_measure, preprocessing=None,
select='max', nclust_range=None, combined_data=False)
Parameters to be specified:
- data: features dataset
- covariates: covariates dataset
- c: clustering algorithm class (with optimized parameters)
- int_measure: internal measure function (e.g., silhouette score, Davies-Bouldin score)
- preprocessing: data reduction algorithm object (with optimized parameters), default
None
- select: it can be ‘min’, if the internal measure is to be minimized or ‘max’ if the internal measure should be maximized
- nclust_range: range of clusters to consider, default
None
- combined_data: define whether multimodal data are used as input features. If
True
, different sets of covariates will be applied for each modality (e.g. correction for TIV only for grey matter features). DefaultFalse
Notes: in case Gaussian Mixture Model was implemented as clustering algorithm, the select_best_bic_aic
function can be used to compute Akaike and Bayesian Information Criterion (AIC, BIC) and used them for model's selection.
4. Example
An example of how to perform NeuReval can be found in the folder NeuReval/scripts. These codes show the application of NeuReval using Gaussian Mixture Model as clustering algorithm, Support Vector Machine as classifier, and UMAP as dimensionality reduction algorithm:
- 01_grid_search: code to perform grid-search cross-validation for clustering/classifier/preprocessing parameters tuning
- 02_run_findbestclustcv: code to perform NeuReval with the optimized clustering/classifier/preprocessing algorithms. This script also provides codes to compute different kind of internal measures
- 03_visualization: code to create a plot for clusters' representation
5. References
Landi, I., Mandelli, V., & Lombardo, M. V. (2021). reval: A Python package to determine best clustering solutions with stability-based relative clustering validation. Patterns, 2(4), 100228.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.