A package for clustering distributions
Project description
A Python package implementing the clustering algorithm proposed in the paper
"An Agglomerative Clustering Algorithm for Simulation Output Distributions Using Regularized Wasserstein Distance",
accepted in the INFORMS Journal on Data Science.
The preprint is available on arXiv:2407.12100.
This link will be updated once the final published version becomes available.
The package can be installed using
pip install --index-url https://pypi.org/simple/ --no-deps distclust==0.0.6
Main function
cluster_distributions(
dist_file,
reg=0.5,
n_clusters=None,
calculate_barycenter=False,
stop_threshold=10 ** -9,
num_of_iterations=1000,
plt_dendrogram=True,
path_dendrogram=None,
sup_barycenter=100,
t0=0.005,
theta=0.005,
):
Description
This function performs hierarchical (agglomerative) clustering of empirical probability distributions using the regularized (entropic) Wasserstein distance.
It takes a JSON-formatted string that encodes a list of distributions, computes all pairwise regularized Wasserstein distances, and then performs agglomerative clustering.
- Returns one JSON string with each distribution and its assigned cluster.
- If
calculate_barycenter=True, it also computes barycenters of each cluster and returns a second JSON string with the barycenters.
Function Parameters
-
dist_file(str):
A JSON-formatted string containing a dictionary of distributions.
Each key in the dictionary is a distribution number, mapped to another dictionary with:"id": The identifier of the distribution."data_points": A list of tuples representing the data points.
Example format:
An example in Github -
reg(float):
Entropic regularization parameter for the Wasserstein distance. Must be positive. -
n_clusters(int or None):
Number of clusters to form. IfNone, the optimal number is chosen using the silhouette index. -
calculate_barycenter(bool):
IfTrue, compute a regularized Wasserstein barycenter for each cluster.
IfFalse, only clustering results are returned. -
stop_threshold(float):
Convergence threshold for the Sinkhorn iterations. -
num_of_iterations(int):
Maximum number of iterations for each regularized Wasserstein distance computation. -
plt_dendrogram: bool, optional (default=True) If True, display a dendrogram of the hierarchical clustering based on OT distances. Ifpath_dendrogramis provided, the plot is also saved to the specified path. -
path_dendrogram: str or None, optional (default=None) If provided, path to save the dendrogram plot. Ignored ifplt_dendrogram=False. -
sup_barycenter(int):
Number of support points to initialize for barycenter computation. -
t0(float):
Base step size for the barycenter probability vector (a) update. -
theta(float):
Relaxation parameter for the barycenter support (X) update.
Returns
If calculate_barycenter=False:
json_clusters(str): JSON with each distribution's ID, real data points, and assigned cluster label.
If calculate_barycenter=True:
json_clusters(str): JSON with each distribution's ID, real data points, and assigned cluster label.json_barycenters(str): JSON with each cluster's barycenter, including unnormalized supports and probability masses.
If plt_dendrogram=True:
- Displays the dendrogram plot.
- If
path_dendrogramis provided, saves the dendrogram as a PNG file to that path.
Other Functions in distclust
We also provide the following functions that might be useful to some users:
density_calc– Compute empirical probability masses.density_calc_list– Batch probability mass computation.fill_ot_distance– Compute and store regularized Wasserstein distances between all systems. (Cuturi and Doucet (2014) [1])plot_dendrogram– Dendrogram visualization.silhouette_score_agglomerative– Choose number of clusters.find_barycenter– Compute Wasserstein barycenter. (Cuturi and Doucet (2014) [1])calculate_OT_cost_bary– OT computation for barycenter step.
References
[1] Marco Cuturi and Arnaud Doucet.
Fast computation of Wasserstein barycenters.
In Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 685–693, 2014.
Link to paper
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file distclust-0.0.6.tar.gz.
File metadata
- Download URL: distclust-0.0.6.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a8839e9361df660d4aa6f314b11396794195b3791e56ad997d117696f588d37
|
|
| MD5 |
815ad61bb2e8cbdc6c3df703fe4e5931
|
|
| BLAKE2b-256 |
cd1a08c9032f35f94d91e11a6a0f87e7928eefd551d347751177a5aeef066624
|
File details
Details for the file distclust-0.0.6-py3-none-any.whl.
File metadata
- Download URL: distclust-0.0.6-py3-none-any.whl
- Upload date:
- Size: 20.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8312c73421775263c311b2712e3d6e91f4cf0d9615f3b0713cdfe30f24583241
|
|
| MD5 |
67d011a4a67855907af31103b66e3ca9
|
|
| BLAKE2b-256 |
fdcbb77c9eda44fde3b0413582848efbb940b57f921a0ef34583c4f82176229c
|