Skip to main content

No project description provided

Project description

Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation

Paper

Introduction

DP-means (Kulis and Jordan, ICML 2012), a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. In our paper, we first study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the pro-posed methods on multiple datasets. We also show that the proposed methods outperform other non-parametric methods (e.g., DBSCAN). Our highly-efficient code, available in this git repository, can be used to reproduce our experiments.

Installation

pip install pdc-dp-means

Installation requires scikit-learn>=1.2,<1.3 and numpy >= 1.23.0.

Quick Start

from sklearn.datasets import make_blobs
from pdc_dp_means import DPMeans

# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply DPMeans clustering
dpmeans = DPMeans(n_clusters=1,n_init=10, delta=10)  # n_init and delta parameters
dpmeans.fit(X)

# Predict the cluster for each data point
y_dpmeans = dpmeans.predict(X)

# Plotting clusters and centroids
import matplotlib.pyplot as plt

plt.scatter(X[:, 0], X[:, 1], c=y_dpmeans, s=50, cmap='viridis')
centers = dpmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
plt.show()

One thing to note is that we replace the \lambda parameter from the paper with delta in the code, as lambda is a reserved word in python.

Usage

Please refer to the documentation: https://pdc-dp-means.readthedocs.io/en/latest/

Code

The code described here is under the folder paper_code. The supplied code has 3 parts -

  • The cluster directory, which contains an extension to sklearn with our proposed algorithms, PDC-DP-Means and its MiniBatch version.
  • the file date_pdpmeans.py which contains our implementation of DACE (in three versions, see below) and PDP-Means.
  • Three notebooks that contain the experiment with the other non-parametric methods.

PDC-DP-Means and MiniBatch PDC-DP-Means

In order to install this, you must clone scikit-learn from: https://github.com/scikit-learn/scikit-learn.git.

Navigate to the directory sklearn/cluster and replace the files __init__.py, _k_means_lloyd.pyx and _kmeans.py with the respective files under the cluster directory. Next, you need to install sklearn from source. To do so, follow the directions here: https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge.

Now, in order to use it, you can simply use from sklearn.cluster import MiniBatchDPMeans, DPMeans. In general, the parameters are the same as the K-Means counterpart: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

The only differences are:

  1. instead of the n_clusters parameter (which stands, in K-Means, for the number fo clusters), there is a new parameter called delta (in our papers it was lambda but avoided this vairable name here since lambda is a reserved word in Python);
  2. When DPMeans is used the algorithm parameter is removed.

DACE and PDP-Means

In the file dace_dpmeans.py there are 4 relevant algorithms -

parallel_dp(data,delta,processes,iters)' - PDP-Means. As before, delta replaces lambda, data' is the data, 'processes' is the amount of parallelization, and `iters' is the maximum iterations (it will stop before if converged).

DACE(data,delta,num_of_processes) - The original DACE algorithm. as before, delta replaces lambda, 'data' is the data, num_of_processes is the amount of parallelization.

DACE_DDP(data,delta,num_of_processes) - DACE using PDC-DP-Means, but with no inner parallelization.

DACE_DDP_SPAWN(data,delta,num_of_processes) - DACE using PDP-DP-Means with inner parallelization, due to different Multi Processing scheme, this might take abit longer to start.

Note that in order to run this file some extra dependencies are required, evaluations.py file contain several functions, and while some packages required are quite standard - torchvision,scikit-learn,annoy,pandas,numpy, it is also required to have a valid R enviroment, and the R package maotai installed, and the python-R interface package rpy2.

Experiment notebooks

We have included the experiments which does not require additional installations apart from the build-from-source scikit-learn, the three attached notebooks are used to recreate the experiments with the other non-parametric methods. Note that the blackbox optimization (while we supplied the code to run it), need to run separately, as it's multiprocess does not play well with Jupyter Notebook.

Citing this work

If you use this code for your work, please cite the following:

@inproceedings{dinari2022revisiting,
  title={Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation},
  author={Dinari, Or and Freifeld, Oren},
  booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
  year={2022}
}

License

Our code is licensed under the BDS-3-Clause license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdc_dp_means-0.0.6-cp311-cp311-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.11Windows x86-64

pdc_dp_means-0.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.6-cp311-cp311-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

pdc_dp_means-0.0.6-cp310-cp310-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.10Windows x86-64

pdc_dp_means-0.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.6-cp310-cp310-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

pdc_dp_means-0.0.6-cp39-cp39-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.9Windows x86-64

pdc_dp_means-0.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.6-cp39-cp39-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pdc_dp_means-0.0.6-cp38-cp38-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.8Windows x86-64

pdc_dp_means-0.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.6-cp38-cp38-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

File details

Details for the file pdc_dp_means-0.0.6-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 304329d5350d062995fee9db6df4625d3954d325c21b10e424fb6888320f9fb7
MD5 677ccccb80b8d3794f165e790ccfed34
BLAKE2b-256 1027c7e788093da58aa205d1e2f03017e22566f1f102becc1320e35dc6940bf4

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bc6f25301e54aa593f05bcc1c7739fafabd66e467e8620bfff0f0bfcd78ac037
MD5 caa98c4545a1ca3c43f20eebc4973a3c
BLAKE2b-256 f785d90a74da3cf80fc4d3b113e3630c96ca98c817355c6e7ff0db2b854a031e

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d7d844191e7762c02140fb1a4f4d4f232f95dc216d479323ea448e07b4480c3b
MD5 bc2cbef1d87c16cca2b7c28ea07fead8
BLAKE2b-256 cfd6395109ae84fa4f7c1cfc5117c13e2fa8d6b4650851f8b36d0c1d30472b01

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 1f7eaf6a1b32ba4807c2c425d964de52b24f298dc2634fdaba3e5522cc04e8f0
MD5 47a12092eef4b9d66dc87a2ea2d5c701
BLAKE2b-256 5b1c2f82e671a9b6cd7a8f4d599d595fffbff8bf3d78a2f2f3891cf0c941170e

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e282d8be16415aae5b482e7a672aede91b14f97f0c1d3bf7b128c26230e893ca
MD5 fdccc033841a22f989fa33b18606dabb
BLAKE2b-256 632d25949f061f5ed9e681c6623cda94ad11c802d673c6caa9f1a8233e46efac

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9a0442fac24a5a0532e6b0bb95919465a57e58235096573da8035dd96f815f12
MD5 f53e6d09fd1b0b4f070da3085aecb33c
BLAKE2b-256 b0a12b3ded4df761de2a08f18f01c2d59ae62ad94e3f47395e17ce6824d37ef2

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: pdc_dp_means-0.0.6-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for pdc_dp_means-0.0.6-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 62371efac2d09c6747dff85b7f7c84fd4cd4b3160dc4fc851f0fa8af4f0ab8a9
MD5 1dce56dc54d1f1a815fad439b3e87821
BLAKE2b-256 4633b9a22c2888fff6335f1bd970c51243f0767dc21b17de0d2c632daac933eb

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f3082649900a7e95d26a6b192ac15c29bfdc7102073bed4e38375d7f68a789af
MD5 b05621260703e265388b4fd8c365cd67
BLAKE2b-256 6972170285437f9cb555ed1e2e42d7fb15a8756fdaf0a05af800c19daa29bf2d

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3b8a06f3f2f44f88ff3842eb9922bd1be023a5560274c9d367afd977ecf9654f
MD5 36eb815be363f7854926d9108075cb65
BLAKE2b-256 436f4ecc56e70226e470c398ee00e5df181005f3f61e691b6a6651e05934ea57

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: pdc_dp_means-0.0.6-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for pdc_dp_means-0.0.6-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 9a384e6de5dbfe3d88973267f4e7e499f71239220498f4d29a37f52c55e8c43f
MD5 618801125a93aa5bfc2e610cebc0e832
BLAKE2b-256 1c7bb41920dad1601e1e836086bde2c1b05a6e04860be9cb2759b167fab593a7

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 624403e1006d8dce0182b2bbd6786969ba331acf7293444fe3790675cbe6b25c
MD5 20449fd1e50fa384f783749db621d63f
BLAKE2b-256 c406a8f485c99b9ae18e97058122c15b78fe0ad5967d8f77797a4ec397a67f43

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.6-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.6-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 6aacc3bd98998471f70c4c3d8dc2b5201c65e028aa1f4331344d22132a069e47
MD5 4b0c9a6c57b32fd6d821b5a3659f5050
BLAKE2b-256 2e1909301fdbbeaa827ba46898c16ca78a9a5b24c9d54265fb6bc43c6db7172f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page