Skip to main content

A package for clustering distributions

Project description

A Python package implementing the clustering algorithm proposed in the paper
"An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance", accepted to the INFORMS Journal on Data Science. The preprint is available on arXiv:2407.12100.
This link will be updated once the final published version becomes available.

The package can be installed using

pip install --index-url https://pypi.org/simple/ --no-deps distclust==1.0.0

Main function

dict_clusters, dict_barycenters = cluster_distributions(
    dist_file,
    reg=0.5,
    n_clusters=None,
    calculate_barycenter=False,
    stop_threshold=10 ** -9,
    num_of_iterations=1000,
    plt_dendrogram=True,
    path_dendrogram=None,
    sup_barycenter=100,
    t0=0.005,
    theta=0.005,
):

Description

This function performs hierarchical (agglomerative) clustering of empirical probability distributions using the regularized (entropic) Wasserstein distance.
It takes a JSON-formatted string that encodes a list of distributions, computes all pairwise regularized Wasserstein distances, and then performs agglomerative clustering.

  • Returns one dictionary with each distribution and its assigned cluster.
  • If calculate_barycenter=True, it also computes barycenters of each cluster and returns a second dictionary with the barycenters.

Function Parameters

  • dist_file (dict):
    A dictionary containing a dictionary of distributions.
    Each key in the dictionary is a distribution number, mapped to another dictionary with:

    • "id": The identifier of the distribution.
    • "data_points": A list of tuples representing the data points.

    Example format: An example in Github

  • reg (float):
    Entropic regularization parameter for the Wasserstein distance. Must be positive.

  • n_clusters (int or None):
    Number of clusters to form. If None, the optimal number is chosen using the silhouette index.

  • calculate_barycenter (bool):
    If True, compute a regularized Wasserstein barycenter for each cluster.
    If False, only clustering results are returned.

  • stop_threshold (float):
    Convergence threshold for the Sinkhorn iterations.

  • num_of_iterations (int):
    Maximum number of iterations for each regularized Wasserstein distance computation.

  • plt_dendrogram : bool, optional (default=True) If True, display a dendrogram of the hierarchical clustering. If path_dendrogram is provided, the plot is also saved to the specified path.

  • path_dendrogram : str or None, optional (default=None) If provided, path to save the dendrogram plot. Ignored if plt_dendrogram=False.

  • sup_barycenter (int):
    Number of support points to initialize for barycenter computation.

  • t0 (float):
    Base step size for the barycenter probability vector (a) update.

  • theta (float):
    Relaxation parameter for the barycenter support (X) update.


Returns

If calculate_barycenter=False:

  • dict_clusters (dict): A dictionary with each distribution's ID, real data points, and assigned cluster label.

If calculate_barycenter=True:

  • dict_clusters (dict): A dictionary with each distribution's ID, real data points, and assigned cluster label.
  • dict_barycenters (dict): A dictionary with each cluster's barycenter, including unnormalized supports and probability masses.

If plt_dendrogram=True:

  • Displays the dendrogram plot.
  • If path_dendrogram is provided, saves the dendrogram as a PNG file to that path.

Other Functions in distclust

We also provide the following functions that might be useful to some users:

  1. density_calc – Compute empirical probability masses.
  2. density_calc_list – Batch probability mass computation.
  3. fill_ot_distance – Compute and store regularized Wasserstein distances between all systems. (Cuturi and Doucet (2014) [1])
  4. plot_dendrogram – Dendrogram visualization.
  5. silhouette_score_agglomerative – Choose number of clusters.
  6. find_barycenter – Compute Wasserstein barycenter. (Cuturi and Doucet (2014) [1])

References

[1] Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 685–693, 2014.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distclust-1.0.0.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distclust-1.0.0-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file distclust-1.0.0.tar.gz.

File metadata

  • Download URL: distclust-1.0.0.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for distclust-1.0.0.tar.gz
Algorithm Hash digest
SHA256 a8fede06db966311ecbab2d8bc185725f67554fd9fa67f30b4fa870e7633d11d
MD5 37ef00d9b740b1eef8f6d96035bbcf8c
BLAKE2b-256 1b20c08412dce60f76f614a470bec6fd5ca0a9933e279cc466681b013e0800a8

See more details on using hashes here.

File details

Details for the file distclust-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: distclust-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for distclust-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05ae940b00563de927417afe4c5bc47d78ca47066f8ce1b578c22a0b447f74d1
MD5 fc1e56a5e56feb72b873c626374a75d6
BLAKE2b-256 d7d7faf278419eb434781cafc375b7f9a12739ddf8c29301fbbd6a4f47ffad30

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page