Skip to main content

A package for clustering distributions

Project description

A Python package implementing the clustering algorithm proposed in the paper
"An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance",
accepted in the INFORMS Journal on Data Science. The preprint is available on arXiv:2407.12100.
This link will be updated once the final published version becomes available.

The package can be installed using

pip install --index-url https://pypi.org/simple/ --no-deps distclust==0.0.4

Main function

cluster_distributions(
    dist_file,
    reg=0.5,
    n_clusters=None,
    calculate_barycenter=False,
    stop_threshold=10 ** -9,
    num_of_iterations=1000,
    plt_dendrogram=True,
    path_dendrogram=None,
    sup_barycenter=100,
    t0=0.005,
    theta=0.005,
):

Description

This function performs hierarchical (agglomerative) clustering of empirical probability distributions using the regularized (entropic) Wasserstein distance.
It takes a JSON-formatted string that encodes a list of distributions, computes all pairwise regularized Wasserstein distances, and then performs agglomerative clustering.

  • Returns one JSON string with each distribution and its assigned cluster.
  • If calculate_barycenter=True, it also computes barycenters of each cluster and returns a second JSON string with the barycenters.

Function Parameters

  • dist_file (str):
    A JSON-formatted string containing a dictionary of distributions.
    Each key in the dictionary is a distribution number, mapped to another dictionary with:

    • "id": The identifier of the distribution.
    • "data_points": A list of tuples representing the data points.

    Example format: (https://github.com/mohammadmgh78/Agglomerative_Clustering_Distribution/blob/main/distclust/JSON_test.txt)

  • reg (float):
    Entropic regularization parameter for the Wasserstein distance. Must be positive.

  • n_clusters (int or None):
    Number of clusters to form. If None, the optimal number is chosen using the silhouette index.

  • calculate_barycenter (bool):
    If True, compute a regularized Wasserstein barycenter for each cluster.
    If False, only clustering results are returned.

  • stop_threshold (float):
    Convergence threshold for the Sinkhorn iterations.

  • num_of_iterations (int):
    Maximum number of Sinkhorn iterations for each OT distance computation.

  • plt_dendrogram (bool):
    If True, generate and display the dendrogram plot.
    If a file path is provided (path_dendrogram), also save it.

  • sup_barycenter (int):
    Number of support points to initialize for barycenter computation.

  • t0 (float):
    Base step size for the barycenter probability vector (a) update.

  • theta (float):
    Relaxation parameter for the barycenter support (X) update.


Returns

If calculate_barycenter=False:

  • json_clusters (str): JSON with each distribution's ID, real data points, and assigned cluster label.

If calculate_barycenter=True:

  • json_clusters (str): JSON with each distribution's ID, real data points, and assigned cluster label.
  • json_barycenters (str): JSON with each cluster's barycenter, including unnormalized supports and probability masses.

If plt_dendrogram=True:

  • Displays the dendrogram plot.
  • If path_dendrogram is provided, saves the dendrogram as a PNG file to that path.

Other Functions in distclust

We also provide the following functions that might be useful to some users:

  1. density_calc – Compute empirical probability masses.
  2. density_calc_list – Batch probability mass computation.
  3. fill_ot_distance – Compute and store regularized Wasserstein distances between all systems.
  4. plot_dendrogram – Dendrogram visualization.
  5. silhouette_score_agglomerative – Choose number of clusters.
  6. find_barycenter – Compute Wasserstein barycenter.
  7. calculate_OT_cost_bary – OT computation for barycenter step.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

distclust-0.0.4.tar.gz (19.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

distclust-0.0.4-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file distclust-0.0.4.tar.gz.

File metadata

  • Download URL: distclust-0.0.4.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for distclust-0.0.4.tar.gz
Algorithm Hash digest
SHA256 83b3da2c592eeb70df2b38c71bf981d5c6eb7dc310915a1362e429618912eeb6
MD5 43232073c66b1b3f75c6486fe79219cb
BLAKE2b-256 b392e0e73ab0dbba905a051862cca6763a680f6a9336b729ffcb8ba5a0bf3c01

See more details on using hashes here.

File details

Details for the file distclust-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: distclust-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for distclust-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 a375963ea3f17e9d7ffbd16b1a517c2931fe1bf5926a264fa0d17a11224330a4
MD5 f870cfa806791eb29b108cda0dde6687
BLAKE2b-256 c09e5784fa1c4eb5ee76dcd296b7a73adfdc20986cb073caf4b61cd556e8d1e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page