Skip to main content

Bayesian optimized parameter selection for density-based clustering algorithms

Project description

DBOpt

DBOpt is a python program enabling reproducible and robust parameter selection for density based clusterering algorithms. The method combines and efficient implementaion of density based cluster validation (DBCV) with Bayesian optimization to find optimal clustering algorithm parameters that maximize the DBCV score. DBOpt is currently compatible with the density based clustering algorithms: DBSCAN, HDBSCAN, and OPTICS. For more information about the DBOpt method read Hammer et al. Preprint at https://www.biorxiv.org/content/10.1101/2024.11.01.621498v1 (2024).

Getting Started

Dependencies

  • k-DBCV
  • BayesianOptimization
  • sci-kit learn
  • NumPy

Installation

DBOpt can be installed via pip:

pip install DBOpt

Usage

DBOpt class can be initialized by setting hyperparameters for the optimization. These include the algorithm to be optimized, the number of optimization iterations (runs), the number of initial parameter combinations to probe (rand_n), and the parameter space that is to be optimized. Each algorithm has its own set of parameters that can be optimized. More information about these parameters can be found in the corresponding scikit-learn documentation.

DBOpt-DBSCAN

For DBSCAN, the relevant parameters are eps and min_samples. Bounds for one or both of these parameters must be set.

model = DBOpt.DBOpt(algorithm = 'DBSCAN', runs = 200, rand_n = 40,
                    eps = [3,200], min_samples = [3,200])

Parameters can be held constant:

model = DBOpt.DBOpt(algorithm = 'DBSCAN', runs = 200, rand_n = 40,
                    eps = [4,200], min_samples = 6)

DBOpt-HDBSCAN

HDBSCAN has two primary parameters, min_cluster_size and min_samples.

model = DBOpt.DBOpt(algorithm = 'HDBSCAN', runs = 200, rand_n = 40,
                    min_cluster_size = [4,200], min_samples = [4,200])

DBOpt is capable of optimizing addition parameters for HDBSCAN including cluster_selection_epsilon, cluster_selection_method, and alpha. In cases like these when parameter spaces are vastly different in size, it can be helpful to scale all parameters the same by setting scale_params = True. scale_params is set to False by default.

model = DBOpt.DBOpt(algorithm = 'HDBSCAN',  runs = 200, rand_n = 40,
                    min_cluster_size = [4,200], min_samples = [4,200], eps = [0,200], method = [0,1], alpha = [0,1],
                    scale_params = True)

DBOpt-OPTICS

OPTICS can currently be optimized with the xi method.

model = DBOpt.DBOpt(algorithm = 'OPTICS', runs = 200, rand_n = 40,
                    xi = [0.05,0.5], min_samples = [4,200])

Optimizing the parameters

Importing Data

The data can be multidimensional coordinates. Here we use the C01 simulation from the data folder.

We create an array X which is a 2D array with x positions in column 0 and y positions in column 1.

Optimizing parameters for the data

Once hyperparameters have beeen set, the algorithm can be optimized for the data.

model.optimize(X)

Information about the chosen parameters and the full parameter sweep can be extracted after optimizing.

parameter_sweep_arr = model.parameter_sweep_
DBOpt_selected_parameters = model.parameters_

The optimization can be plotted:

parameter_sweep_plot = model.plot_optimization()

Clustering

The data is clustered via the fit function.

model.fit(X)

The optimization step and fit step can be performed together:

model.optimize_fit(X)

After fitting the labels and DBCV score can be stored:

labels = model.labels_
DBCV_score = model.DBCV_score_

The clusters can be plotted where show_noise will determine if the noise is shown or not (Default = True) and setting ind_cluster_scores = True will plot clusters colormapped to the individual cluster scores instead of colored randomly (Default = False) :

cluster_plot = model.plot_clusters()

cluster_plot_modified = model.plot_clusters(show_noise = True, ind_cluster_scores = True)

License

DBOpt is licensed with an MIT license. See LICENSE file for more information.

Referencing

If you use DBOpt for your work, cite with the following (currently in preprint):

Hammer, J. L., Devanny, A. J. & Kaufman, L. J. Density-based optimization for unbiased, reproducible clustering applied to single molecule localization microscopy. Preprint at https://www.biorxiv.org/content/10.1101/2024.11.01.621498v1 (2024)

Contact

kaufmangroup.rubylab@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dbopt-1.0.0.tar.gz (17.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

DBOpt-1.0.0-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file dbopt-1.0.0.tar.gz.

File metadata

  • Download URL: dbopt-1.0.0.tar.gz
  • Upload date:
  • Size: 17.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dbopt-1.0.0.tar.gz
Algorithm Hash digest
SHA256 f1ba63824d4a8ec88ffe8afd13392dbca4beaad5eafa34fb34a2e7d7860fec34
MD5 e239e99511cfd001700fbe0d1b0429f8
BLAKE2b-256 c80579681dbcb5a89bfb9efd0a4f2bc002258b55318bb2d0cc465f047eaed821

See more details on using hashes here.

File details

Details for the file DBOpt-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: DBOpt-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for DBOpt-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fe679b115e94b371589f5cc47a90dd7c9710dd1949ab04da1db7c012be84aec7
MD5 74bd73a93ea2d08d1e8291694503a5d6
BLAKE2b-256 3908fc9b19d787ff3634d006cf3e39d8da3f444024da325cd0c92502f9e0944d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page