Skip to main content

Hyperparameter optimization for multiple clustering algorithms using Optuna, with Scikit-learn API

Project description

optuclust

optuclust is a Python module for optimizing clustering algorithms using the Optuna framework. It provides a scikit-learn compatible API with support for a variety of clustering methods and offers additional capabilities such as the calculation of centroids, medoids, and modes for clusters.

Python manual install DOI

Features

  • Parameter Optimization: Optimize clustering parameters for various algorithms using Optuna.
  • Supported Clustering Methods:
    • Algorithms from scikit-learn, such as KMeans, DBSCAN, and Agglomerative Clustering.
    • Advanced methods like HDBSCAN, Self-Organizing Maps (SOM), and kMedoids.
  • Metrics and Scoring:
    • silhouette_score
    • calinski_harabasz_score
    • davies_bouldin_score (automatically minimized)
    • Noise points (label=-1) are filtered out before score computation for density-based algorithms.
  • Clustering Insights: Provides centroids (arithmetic mean), medoids (Euclidean), and modes (KDE with Scott's bandwidth) for clusters, even if the algorithm does not natively support these features. All descriptors are computed eagerly during fit() and work in any number of dimensions.
  • Scikit-learn Compatible: Inherits from BaseEstimator and ClusterMixin. Works with clone(), check_is_fitted(), and scikit-learn pipelines.
  • ClustGridSearch Class: A utility to test all clustering algorithms and identify the best one.
  • Timeout Management: Separate timeouts for optimization runs (timeout) and individual trials (trial_timeout).
  • Storage and Resume: Store optimization results in a SQLite database for future analysis, and resume the optimization process later.

Installation

  1. Clone this repository:

    git clone git@github.com:filipsPL/optuclust.git
    
  2. Navigate to the cloned directory and install the required dependencies:

    cd optuclust
    pip install -r requirements.txt
    
  3. Install optuclust:

    python setup.py install
    

Requires: Python >= 3.8, scikit-learn >= 1.1

Usage

1. Optimizing a Clustering Algorithm

from optuclust import Optimizer
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=42)

# Instantiate and fit the optimizer for KMeans
optimizer = Optimizer(algorithm="kmeans", n_trials=50, scoring="silhouette_score", verbose=True)
optimizer.fit(X)

# Access cluster details
print("Cluster Labels:", optimizer.labels_)
print("Centroids:", optimizer.centroids_)
print("Medoids:", optimizer.medoids_)
print("Modes:", optimizer.modes_)
print("Cluster Centers (native):", optimizer.cluster_centers_)

2. ClustGridSearch

from optuclust import ClustGridSearch
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, n_features=2, random_state=42)

# Initialize ClustGridSearch to test all algorithms
grid_search = ClustGridSearch(mode="full", scoring="silhouette_score", verbose=True)

# Fit and get the best method
grid_search.fit(X)
print("Best Algorithm:", grid_search.best_estimator_.algorithm)
print("Best Score:", grid_search.best_score_)
print("Best Parameters:", grid_search.best_params_)

3. Benchmark Example

To benchmark different clustering algorithms, you can use the provided example script:

python example-loop.py

The benchmark will evaluate different clustering methods on various datasets and save the performance metrics and plots.

Supported Algorithms

algorithms = [
    'kmeans', 'kmedoids', 'minibatchkmeans', 'dbscan', 'agglomerativeclustering',
    'meanshift', 'spectralclustering', 'gaussianmixture', 'hdbscan',
    'affinitypropagation', 'birch', 'optics', 'som'
]

Note: Not all algorithms support predict() on new data. Algorithms with inductive prediction: kmeans, minibatchkmeans, meanshift, birch, gaussianmixture, kmedoids, som. Calling predict() on other algorithms (e.g. dbscan, hdbscan) will raise a TypeError.

Parameters

Optimizer Class

  • algorithm: The clustering algorithm to optimize. Options include those listed in Supported Algorithms.
  • n_trials: Number of Optuna trials for optimization. Default is 50.
  • scoring: The metric to optimize. Options are silhouette_score, calinski_harabasz_score, and davies_bouldin_score.
  • verbose: Enable additional logging if set to True. Can also be an int to set Optuna's verbosity level directly.
  • show_progress_bar: Display a progress bar during optimization. Default is True.
  • timeout: Maximum duration (in seconds) for all trials in the optimization process.
  • trial_timeout: Maximum duration (in seconds) for each individual trial (Unix only, uses SIGALRM).
  • storage: Optuna storage URI, e.g. sqlite:///optimization.db. When provided, enables resuming a previous optimization run.
  • logfile: Reserved for future use.

Fitted Attributes

After calling fit(X):

  • labels_: Cluster labels for each sample.
  • best_params_: Dictionary of the best hyperparameters found.
  • model_: The fitted clustering model with the best parameters.
  • study_: The Optuna Study object with full trial history.
  • centroids_: Arithmetic mean of each cluster (excludes noise points).
  • medoids_: Most central data point in each cluster (Euclidean distance).
  • modes_: Highest density point in each cluster (KDE with Scott's rule bandwidth).
  • cluster_centers_: Native cluster centers from the model (if available), otherwise None.

ClustGridSearch Class

  • mode:
    • full: Test all algorithms.
    • fast: Test a subset of algorithms (kmeans and hdbscan).
  • n_trials: Number of Optuna trials for each algorithm. Default is 20.
  • scoring: Metric to select the best clustering algorithm. Options are silhouette_score, calinski_harabasz_score, and davies_bouldin_score.
  • verbose: Enable detailed logging if set to True.
  • show_progress_bar: Display a progress bar for each algorithm.

Running Tests

We use pytest for testing. To run tests, simply run:

pytest -v

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

optuclust-0.0.2.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

optuclust-0.0.2-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file optuclust-0.0.2.tar.gz.

File metadata

  • Download URL: optuclust-0.0.2.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for optuclust-0.0.2.tar.gz
Algorithm Hash digest
SHA256 d755d5e04b9f6e2b01f01e8410d35f2d8a183d2b6c51b8bdf43ac47a6c1b9519
MD5 ef7380df52be05c3f69df5365bb6aa82
BLAKE2b-256 7f766aee5dec61de9bf7bd6276a65cef30e4a839d492e87d2971ebf6c3ed80be

See more details on using hashes here.

File details

Details for the file optuclust-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: optuclust-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for optuclust-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 974d3a63078af77746278f822fcd7da75b9a4281b231415c6da226bc59ac7a20
MD5 af7ab07bdb60a30ac6db3d54f1e14bbc
BLAKE2b-256 51d8f933c6617706ba05748859cd76b9c2781c948a921dbf944e9980b8419b90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page