Skip to main content

No project description provided

Project description

Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation

Paper

Introduction

DP-means (Kulis and Jordan, ICML 2012), a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. In our paper, we first study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the pro-posed methods on multiple datasets. We also show that the proposed methods outperform other non-parametric methods (e.g., DBSCAN). Our highly-efficient code, available in this git repository, can be used to reproduce our experiments.

Installation

pip install pdc-dp-means

Usage

Please refer to the documentation: https://pdc-dp-means.readthedocs.io/en/latest/

Code

The code described here is under the folder paper_code. The supplied code has 3 parts -

  • The cluster directory, which contains an extension to sklearn with our proposed algorithms, PDC-DP-Means and its MiniBatch version.
  • the file date_pdpmeans.py which contains our implementation of DACE (in three versions, see below) and PDP-Means.
  • Three notebooks that contain the experiment with the other non-parametric methods.

PDC-DP-Means and MiniBatch PDC-DP-Means

In order to install this, you must clone scikit-learn from: https://github.com/scikit-learn/scikit-learn.git.

Navigate to the directory sklearn/cluster and replace the files __init__.py, _k_means_lloyd.pyx and _kmeans.py with the respective files under the cluster directory. Next, you need to install sklearn from source. To do so, follow the directions here: https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge.

Now, in order to use it, you can simply use from sklearn.cluster import MiniBatchDPMeans, DPMeans. In general, the parameters are the same as the K-Means counterpart: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

The only differences are:

  1. instead of the n_clusters parameter (which stands, in K-Means, for the number fo clusters), there is a new parameter called delta (in our papers it was lambda but avoided this vairable name here since lambda is a reserved word in Python);
  2. When DPMeans is used the algorithm parameter is removed.

DACE and PDP-Means

In the file dace_dpmeans.py there are 4 relevant algorithms -

parallel_dp(data,delta,processes,iters)' - PDP-Means. As before, delta replaces lambda, data' is the data, 'processes' is the amount of parallelization, and `iters' is the maximum iterations (it will stop before if converged).

DACE(data,delta,num_of_processes) - The original DACE algorithm. as before, delta replaces lambda, 'data' is the data, num_of_processes is the amount of parallelization.

DACE_DDP(data,delta,num_of_processes) - DACE using PDC-DP-Means, but with no inner parallelization.

DACE_DDP_SPAWN(data,delta,num_of_processes) - DACE using PDP-DP-Means with inner parallelization, due to different Multi Processing scheme, this might take abit longer to start.

Note that in order to run this file some extra dependencies are required, evaluations.py file contain several functions, and while some packages required are quite standard - torchvision,scikit-learn,annoy,pandas,numpy, it is also required to have a valid R enviroment, and the R package maotai installed, and the python-R interface package rpy2.

Experiment notebooks

We have included the experiments which does not require additional installations apart from the build-from-source scikit-learn, the three attached notebooks are used to recreate the experiments with the other non-parametric methods. Note that the blackbox optimization (while we supplied the code to run it), need to run separately, as it's multiprocess does not play well with Jupyter Notebook.

Citing this work

If you use this code for your work, please cite the following:

@inproceedings{dinari2022revisiting,
  title={Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation},
  author={Dinari, Or and Freifeld, Oren},
  booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
  year={2022}
}

License

Our code is licensed under the BDS-3-Clause license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdc_dp_means-0.0.5-cp311-cp311-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.11Windows x86-64

pdc_dp_means-0.0.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.5-cp311-cp311-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

pdc_dp_means-0.0.5-cp310-cp310-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.10Windows x86-64

pdc_dp_means-0.0.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.5-cp310-cp310-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

pdc_dp_means-0.0.5-cp39-cp39-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.9Windows x86-64

pdc_dp_means-0.0.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.5-cp39-cp39-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pdc_dp_means-0.0.5-cp38-cp38-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.8Windows x86-64

pdc_dp_means-0.0.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.5-cp38-cp38-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

File details

Details for the file pdc_dp_means-0.0.5-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 d1bdd72ac4f8effdb377e136e8e33263f62b5ad509ae2b5154727c46b7040c95
MD5 dd97e667664f6334ab7ad984ff07bd4f
BLAKE2b-256 95a8eebd8bc441c680b470e0142d7d3f166557bc5dc28268b60b0d4db4a81942

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4c5b0e4311222b3ad444e6fdd9d9b079acd79108e5e9859a57864a4dc965e940
MD5 1ff868d42b0ef34e2e865c77783c49f7
BLAKE2b-256 f34ee0e38561710e91c3be4edcc38c1d6bc0c4fac85a6506e9bb4771be0d5709

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d7e9b2589fe87dfdb64a4018d9a719a7c0b7920da8114f8477767342a80469b4
MD5 9caf939c9a952593a9cce740a4cc2b38
BLAKE2b-256 6e2ca76cdd755cbf14b01e8148d3c285f6a0a78ba8c5db15d0ac1fb9884ba7ca

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 0a692e26b01083c5e06f9af7231671a1f76369bb0d0e1b7c15665d3587ea992e
MD5 facbebd5d8b19b170b520ff035453c24
BLAKE2b-256 1dfdeb029788b8fb30ea4b27038f23975cb9bf15b2bafa391030e0723e53b313

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 44518cd1c04ee60c59a60c1c8d083a89953c6c838625d7f0970d4caaf0087eb0
MD5 e480661120007df067ef30da378ed3e3
BLAKE2b-256 bfc1cc2739239d7b4217e6638175f7382f2dd140f12745323a6b8e1ef2edccda

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2e8e2ea7f5530991f8e745f324a8857c5dc3dbb851ccc2c0581646ff6e4d8bd3
MD5 0cbc6baaeabd02801a6a4203c5de6ae6
BLAKE2b-256 36c29f48dae444a1c2dbdea48740c063dd4dec43e82dbcf2464b6740f38ced55

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: pdc_dp_means-0.0.5-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for pdc_dp_means-0.0.5-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 326dc473342df67e6722da31a820d52bae12a5452807e3437cdf34e9d60fb010
MD5 d258bd865fcdfb2136541f7a87e4882f
BLAKE2b-256 577cb9ed415e9af6b79110ab7e4a9cd4785f13f906d01f19fe38550fe7c21cfc

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1fa99ad7a6918c071049d5d5cde3e24c91108b1526a06c0cb34f8ec5c465c232
MD5 e47d0a72f4c1df8be1cf32109d29eea1
BLAKE2b-256 c4924205cee458da2c2eb5d02fe930aaf8cdf4bd951e7e37c0f5cefa05f9ab7e

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c0ba7e2f8f11bf6e16e17b04474760c31d13d653642457d8f297c19d2990f41b
MD5 2e3062208cde2d5f6997d9bccb67aefc
BLAKE2b-256 3cbd34b6b1c493ec7b4269cfe5bb452a3d547d3cab8a3a043faf27c14224a031

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: pdc_dp_means-0.0.5-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for pdc_dp_means-0.0.5-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 3466fd8cbae84650f0ec8eba9fba6ceb5262483bb4ba343b2e49b9370247fc66
MD5 f070256e8ebf7ff4c43c98126321c040
BLAKE2b-256 0aefccb44baf05bd0a0bd66621c7d62ca41082b104c82f17bf86b35f717c0398

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ef0eb2d7c939383679a5053651dbe747c858be7455226d220b8d625ca82e044
MD5 5250f4c07b3e397529e93f65c158f160
BLAKE2b-256 d9a6b8e855163ee1d51d8b6da7d7d94035af9aeca27068e629fefbd7a1896af7

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.5-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.5-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 91bfd2fbd82388c35fc3305dc94a57da294131ad751543187908c93492b1acc5
MD5 c4699d272917c49592d320e635a81bae
BLAKE2b-256 841b764a7e75757615b3f12e7fe0c2562cf430d76a88346d0d504ecb5fbaaacf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page