Skip to main content

No project description provided

Project description

Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation

Paper

Introduction

DP-means (Kulis and Jordan, ICML 2012), a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. In our paper, we first study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the pro-posed methods on multiple datasets. We also show that the proposed methods outperform other non-parametric methods (e.g., DBSCAN). Our highly-efficient code, available in this git repository, can be used to reproduce our experiments.

Installation

pip install pdc-dp-means

Usage

Please refer to the documentation: https://pdc-dp-means.readthedocs.io/en/latest/

Code

The code described here is under the folder paper_code. The supplied code has 3 parts -

  • The cluster directory, which contains an extension to sklearn with our proposed algorithms, PDC-DP-Means and its MiniBatch version.
  • the file date_pdpmeans.py which contains our implementation of DACE (in three versions, see below) and PDP-Means.
  • Three notebooks that contain the experiment with the other non-parametric methods.

PDC-DP-Means and MiniBatch PDC-DP-Means

In order to install this, you must clone scikit-learn from: https://github.com/scikit-learn/scikit-learn.git.

Navigate to the directory sklearn/cluster and replace the files __init__.py, _k_means_lloyd.pyx and _kmeans.py with the respective files under the cluster directory. Next, you need to install sklearn from source. To do so, follow the directions here: https://scikit-learn.org/stable/developers/advanced_installation.html#install-bleeding-edge.

Now, in order to use it, you can simply use from sklearn.cluster import MiniBatchDPMeans, DPMeans. In general, the parameters are the same as the K-Means counterpart: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

The only differences are:

  1. instead of the n_clusters parameter (which stands, in K-Means, for the number fo clusters), there is a new parameter called delta (in our papers it was lambda but avoided this vairable name here since lambda is a reserved word in Python);
  2. When DPMeans is used the algorithm parameter is removed.

DACE and PDP-Means

In the file dace_dpmeans.py there are 4 relevant algorithms -

parallel_dp(data,delta,processes,iters)' - PDP-Means. As before, delta replaces lambda, data' is the data, 'processes' is the amount of parallelization, and `iters' is the maximum iterations (it will stop before if converged).

DACE(data,delta,num_of_processes) - The original DACE algorithm. as before, delta replaces lambda, 'data' is the data, num_of_processes is the amount of parallelization.

DACE_DDP(data,delta,num_of_processes) - DACE using PDC-DP-Means, but with no inner parallelization.

DACE_DDP_SPAWN(data,delta,num_of_processes) - DACE using PDP-DP-Means with inner parallelization, due to different Multi Processing scheme, this might take abit longer to start.

Note that in order to run this file some extra dependencies are required, evaluations.py file contain several functions, and while some packages required are quite standard - torchvision,scikit-learn,annoy,pandas,numpy, it is also required to have a valid R enviroment, and the R package maotai installed, and the python-R interface package rpy2.

Experiment notebooks

We have included the experiments which does not require additional installations apart from the build-from-source scikit-learn, the three attached notebooks are used to recreate the experiments with the other non-parametric methods. Note that the blackbox optimization (while we supplied the code to run it), need to run separately, as it's multiprocess does not play well with Jupyter Notebook.

Citing this work

If you use this code for your work, please cite the following:

@inproceedings{dinari2022revisiting,
  title={Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation},
  author={Dinari, Or and Freifeld, Oren},
  booktitle={The 38th Conference on Uncertainty in Artificial Intelligence},
  year={2022}
}

License

Our code is licensed under the BDS-3-Clause license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pdc_dp_means-0.0.4-cp311-cp311-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.11Windows x86-64

pdc_dp_means-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.4-cp311-cp311-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

pdc_dp_means-0.0.4-cp310-cp310-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.10Windows x86-64

pdc_dp_means-0.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.4-cp310-cp310-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

pdc_dp_means-0.0.4-cp39-cp39-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.9Windows x86-64

pdc_dp_means-0.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.4-cp39-cp39-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pdc_dp_means-0.0.4-cp38-cp38-win_amd64.whl (2.5 MB view details)

Uploaded CPython 3.8Windows x86-64

pdc_dp_means-0.0.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

pdc_dp_means-0.0.4-cp38-cp38-macosx_10_9_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

File details

Details for the file pdc_dp_means-0.0.4-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 b64400e37fb01ba3d5ab5d2596717d5261a6decfb9430dc2952cb8e7039fa475
MD5 d2c2d79811aece271a448268ec410b2d
BLAKE2b-256 d16843f88e7e4b30ce658f8f910678dc041d4ae0083eea4bc5b30f5404c6bbdc

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 745afe04149bf72a26f158ed5b055bafa61bb1ec3b35c18a3c953bd1bd680ff6
MD5 ea82a6b8a7d3751a77fb457352cd99d8
BLAKE2b-256 f04e7af654cbcebf3195f6dc73f35b82c0ea96966159fdac5ccc8043037d1867

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9a0ad1529193aa19edde56b57d78bde3b498011f0bbd8ef1097fe4ad1a12bef8
MD5 afc5a664da2f0bdf7ed070cd27ed868f
BLAKE2b-256 3e4da47df4a3e8d33ac9318a2b383bd4908af8303999f9a0b218015e2fbc05f3

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 a55e6d3960ccc2a59c2a635cb7a91468fb7af1a2f1194f60ad958e19607444a9
MD5 b67b69bbebf1e1f3d757c9a13e34440a
BLAKE2b-256 23e031343f45e9e59cb60b077e628e14b3169177853ce2ecabaad4b42f53ec09

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5663c6e76ae90cb33fa87422d8bb46a582f9d432a05116c63d031335d0726716
MD5 cb16d1abc91deea2b64f0af2efc32df3
BLAKE2b-256 e975d49da9379705bfb6a98c10586d56b2f0cd0a18a3cae7277c83f0a57316fe

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 06a11a8328b995a39d2558b489abae30208e0855787db5d9ba755e5f631c9d26
MD5 736d1950be17d1561c9ca648f4a42f2e
BLAKE2b-256 cd496da8612f6e0ecc6c5ce0d7c3e49162792caf335e14fba40cca480a486aac

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: pdc_dp_means-0.0.4-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for pdc_dp_means-0.0.4-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 e790e7d32bb261cfc25304c1da3fa1b99ac922419401696092191332516ff112
MD5 814e7eb56c54797ab8da53800137fc1d
BLAKE2b-256 c16dafe22ff10cfecf0a7ea3593407b7ce1c9a8f4bc88f86944cee6b8ab5204e

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d2dee958641c5a5b2623f74dd1706a5f2d285cc16da451e98b19ac1ea5df75ff
MD5 7640caec76ce7d8ac6a6143a363bb118
BLAKE2b-256 c7be84514855648647b5c56827da3f6b3787263038d8377dbfef7a086412fe0b

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8b9b763c202e1f6b0ba03ce45a6dd9d489b23e574a1c66581c843016bd7e5ba7
MD5 a9b8cc353da3413fee89648abce8d749
BLAKE2b-256 c6330f50ff5a3246db1d68449f26d1422f0fd3f870e0190b930e674400fb3931

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: pdc_dp_means-0.0.4-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for pdc_dp_means-0.0.4-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 5ffc16f7d5179139ee97430c795fc8366f2d5dff14ae43e0f2394148e08f6b5f
MD5 5da13b56928c18806646e0692ef6bdbb
BLAKE2b-256 3d5d33394cb6ca9034f4846a1ef33d0db2ab36453e5338f1b119c287488a102b

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 94d15a63003ba8910b305444c151ce85557a8276c1055e06f8c52ceed507b73c
MD5 87a5defb19773fa7d2e89399e93d4249
BLAKE2b-256 6f15aeeb6db7774c95d6c9cfe8db8d574bdd258bd5c703bd4501b72099f88d18

See more details on using hashes here.

File details

Details for the file pdc_dp_means-0.0.4-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pdc_dp_means-0.0.4-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 27f34923d38cba2c4d40e97c7cd96e01e0527ed3adea5eee4fa71f8d7869a5a3
MD5 e3febfe8be798a735079b3f300a7c7b6
BLAKE2b-256 5170085810fe13717df3004606d6ad52e917d3e846c16ad3dea711d2c76c26e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page