Skip to main content

No project description provided

Project description

Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation

Paper

Introduction

DP-means (Kulis and Jordan, ICML 2012), a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. In our paper, we first study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the pro-posed methods on multiple datasets. We also show that the proposed methods outperform other non-parametric methods (e.g., DBSCAN). Our highly-efficient code, available in this git repository, can be used to reproduce our experiments.

Code

The supplied code has 3 parts -

  • The cluster directory, which contains an extension to sklearn with our proposed algorithms, PDC-DP-Means and its MiniBatch version.
  • the file date_pdpmeans.py which contains our implementation of DACE (in three versions, see below) and PDP-Means.
  • Three notebooks that contain the experiment with the other non-parametric methods.

PDC-DP-Means and MiniBatch PDC-DP-Means

In order to install this, you must clone scikit-learn from: https://github.com/scikit-learn/scikit-learn.git.

Navigate to the directory sklearn/cluster and replace the files __init__.py, _k_means_lloyd.pyx and _kmeans.py with the respective files under the cluster directory. Next, you need to install sklearn from source. To do so, follow the directions here: https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge.

Now, in order to use it, you can simply use from sklearn.cluster import MiniBatchDPMeans, DPMeans. In general, the parameters are the same as the K-Means counterpart: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

The only differences are:

  1. instead of the n_clusters parameter (which stands, in K-Means, for the number fo clusters), there is a new parameter called delta (in our papers it was lambda but avoided this vairable name here since lambda is a reserved word in Python);
  2. When DPMeans is used the algorithm parameter is removed.

DACE and PDP-Means

In the file dace_dpmeans.py there are 4 relevant algorithms -

parallel_dp(data,delta,processes,iters)' - PDP-Means. As before, delta replaces lambda, data' is the data, 'processes' is the amount of parallelization, and `iters' is the maximum iterations (it will stop before if converged).

DACE(data,delta,num_of_processes) - The original DACE algorithm. as before, delta replaces lambda, 'data' is the data, num_of_processes is the amount of parallelization.

DACE_DDP(data,delta,num_of_processes) - DACE using PDC-DP-Means, but with no inner parallelization.

DACE_DDP_SPAWN(data,delta,num_of_processes) - DACE using PDP-DP-Means with inner parallelization, due to different Multi Processing scheme, this might take abit longer to start.

Note that in order to run this file some extra dependencies are required, evaluations.py file contain several functions, and while some packages required are quite standard - torchvision,scikit-learn,annoy,pandas,numpy, it is also required to have a valid R enviroment, and the R package maotai installed, and the python-R interface package rpy2.

Experiment notebooks

We have included the experiments which does not require additional installations apart from the build-from-source scikit-learn, the three attached notebooks are used to recreate the experiments with the other non-parametric methods. Note that the blackbox optimization (while we supplied the code to run it), need to run separately, as it's multiprocess does not play well with Jupyter Notebook.

Citing this work

If you use this code for your work, please cite the following:

@inproceedings{dinari2022revisiting,
  title={Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation},
  author={Dinari, Or and Freifeld, Oren},
  booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
  year={2022}
}

License

While our additional code is under the GPL license, some code snippets here are taken from sklearn, and using them (e.g., the code under the cluster directory) is subject to sklearn's license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdc_dp_means-0.0.2-cp39-cp39-manylinux1_x86_64.whl (133.7 kB view details)

Uploaded CPython 3.9

pdc_dp_means-0.0.2-cp39-cp39-any.whl (133.7 kB view details)

Uploaded CPython 3.9

pdc_dp_means-0.0.2-cp38-cp38-any.whl (131.3 kB view details)

Uploaded CPython 3.8

File details

Details for the file pdc_dp_means-0.0.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fd5c9717c2262e008a757dd788e0908f64a6d568d2ba144b8341306f7aacd6e1
MD5 d37e163f353cf873f5bcea12fadf90d8
BLAKE2b-256 4906d25fcd97ba450d5a918a32db0e9def1babcfd9598237fc521030bac6750e

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.2-cp39-cp39-any.whl.

File metadata

  • Download URL: pdc_dp_means-0.0.2-cp39-cp39-any.whl
  • Upload date:
  • Size: 133.7 kB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for pdc_dp_means-0.0.2-cp39-cp39-any.whl
Algorithm Hash digest
SHA256 fb114540b0f002452f9a0e67429a1a0f017cbb5017fb9fd311b8cd7375b5b703
MD5 6cdbbeb2a296bfc419698e674c7f09e0
BLAKE2b-256 b6a0a24cd4a16ff4548a06ddc90106091b465157f51820bc2addffe228324cf0

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.2-cp38-cp38-any.whl.

File metadata

  • Download URL: pdc_dp_means-0.0.2-cp38-cp38-any.whl
  • Upload date:
  • Size: 131.3 kB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for pdc_dp_means-0.0.2-cp38-cp38-any.whl
Algorithm Hash digest
SHA256 58d4ef1692b6cd4df737fc5dda87514ef3a4d5364a18fbff8fa33521194500ce
MD5 bb08f1984b236a4d87a6266ef5df7268
BLAKE2b-256 b3734c7d66bec32194e7129cd8ff77c68f355068e218e1bc09472ecfd6ff2985

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page