Skip to main content

An Implementation of Component-wise Peak Finding Clustering Method

Project description

Build Status

CPFcluster

An implementation of the Component-wise Peak-Finding (CPF) clustering method, presented in 'Scalable Clustering of Mixed Data using Nearest Neighbor Graphs and Density Peaks'

Dependencies

CPFcluster supports Python 3, with numpy, scipy, itertools, multiprocessing and aghasher. These should be linked with a BLAS implementation (e.g., OpenBLAS, ATLAS, Intel MKL). The package aghasher is used to implement the k nearest neighbour graph approximation introduced in Zhang et. al. (2013).

Installation

CPFcluster is available on PyPI, the Python Package Index.

$ pip install CPFcluster

How To Use

To use CPFcluster, first import the CPFcluster module.

import CPFcluster

Clustering a Dataset

A CPFclustering is constructed using the train method, which returns an CPFclustering of a dataset.

result = CPFcluster.CPFclustering.train(X, k, K, beta, reps, num_hashbits, blocksz, n_core)

CPFclustering.train takes 8 arguments:

  • X An n-by-d numpy.ndarray with training data. The rows correspond to n observations, and the columns correspond to d dimensions.
  • k Number of nearest-neighbors used to create connected components from the dataset.
  • K Number of nearest-neighbors used to compute the local density of each instance.
  • beta (optional; defaults to 30) Number of clusters to be tested for each component in the center selection method.
  • reps (optional; defaults to 50) Number of repetitions of the locality sensitive hashing method used in computing the k nearest-neighbor graphs.
  • num_hashbits (optional; defaults to 12) Number of hashbits used in locality sensitive hashing method.
  • blocksz (optional; defaults to 100) Size of the neighborhood on which brute force kNN is computed in locality sensitive hashing method.
  • n_core (optional; defaults to 1) Number of processors to be used when computing nearest-neighbor graph. If set to 1, parallel processing does not take place.

The result object contains:

  • edge_arr An n-by-K array of the indexes of the K nearest neighbors within a component for each instance.
  • w_arr An n-by-K array of the distances from each instance to the corresponding node in the edge_arr.
  • components A vector containing the index of the component to which the instance belongs. If the instance is an outlying point, the value will be NaN.
  • density A vector of the local density values.
  • dists A vector containing the distance to the nearest neighbor of higher local density for each instance.
  • bb A vector containing the index of the nearest neighbor of higher local density for each instance.
  • y The final cluster labellings. Tests

CPFcluster

CPFcluster has an MIT License.

See LICENSE.

References

Zhang, Yan-Ming, et al. “Fast kNN graph construction with locality sensitive hashing.“ Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Berlin, Heidelberg, 2013.

Liu, Wei, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2011. “Hashing with Graphs.” In Proceedings of the 28th International Conference on Machine Learning (ICML-11), edited by Lise Getoor and Tobias Scheffer, 1–8. ICML ’11. New York, NY, USA: ACM.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

CPFcluster-0.8.0.tar.gz (11.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page