DBSCAN_multiplex

Fast and memory-efficient DBSCAN clustering,possibly on various subsamples out of a common dataset

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Overview

A fast and memory-efficient implementation of DBSCAN (Density-Based Spatial Clustering of Applications with Noise).

It is especially suited for multiple rounds of down-sampling and clustering from a joint dataset: after an initial overhead O(N log(N)), each subsequent run of clustering will have O(N) time complexity.

As illustrated by a doctest embedded in the present module’s docstring, on a dataset of 15,000 samples and 47 features, DBSCAN_multiplex processes 50 rounds of sub-sampling and clustering in about 4 minutes, whereas Scikit-learn’s implementation of DBSCAN performs the same task in more than 28 minutes.

Such a test can be performed quite conveniently on your machine: simply entering python DBSCAN_multiplexin your terminal will prompt a doctest to start comparing the performance of the two afore-mentioned implementations of DBSCAN.

This 7-fold gain in performance proved critical to the statistical learning application that prompted the design of this algorithm.

Installation and Requirements

DBSCAN_multiplex requires Python 2.7 along with the following packages and a few modules from the Standard Python Library: * NumPy >= 1.9 * PyTables * scikit-learn

Upon checking that the required dependencies are installed, you can upload DBSCAN_multiplex from the official Python Package Index (PyPI) as follows: * open a terminal window; * type the command: pip install DBSCAN_multiplex

Usage and Example

See the docstrings associated to each function of the DBSCAN_multiplex module for more information; in a Python interpreter console, they can be viewed by calling the built-in help system, e.g., help(DBSCAN_multiplex.load).

The following few lines show how DBSCAN_multiplex can be used for clustering 50 randomly selected subsamples out of a common Gaussian distributed dataset. This situation arises in consensus clustering where one might want to obtain and then combine multiple vectors of cluster labels.

>>> import numpy as np
>>> import DBSCAN_multiplex as DB

>>> data = np.random.randn(15000, 7)
>>> N_iterations = 50
>>> N_sub = 9 * data.shape[0] / 10
>>> subsamples_matrix = np.zeros((N_iterations, N_sub), dtype = int)
>>> for i in xrange(N_iterations): subsamples_matrix[i] = np.random.choice(data.shape[0], N_sub, replace = False)
>>> eps, labels_matrix = DB.DBSCAN(data, minPts = 3, subsamples_matrix = subsamples_matrix, verbose = True)

References

* Ester, M., Kriegel, H.-P., Sander, J. and Xu, X. (1996) “A density-based algorithm for discovering clusters in large spatial databases with noise”. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press. pp. 226-231.

* Kriegel, H.-P., Kroeger, P., Sander, J. and Zimek, A. (2011) “Density-based Clustering”. WIREs Data Mining and Knowledge Discovery 1 (3): 231-240.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.5

Dec 7, 2015

This version

1.4

Nov 18, 2015

1.3

Nov 17, 2015

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DBSCAN_multiplex-1.4.tar.gz (10.0 kB view hashes)

Uploaded Nov 18, 2015 Source

Hashes for DBSCAN_multiplex-1.4.tar.gz

Hashes for DBSCAN_multiplex-1.4.tar.gz
Algorithm	Hash digest
SHA256	`90b79aa61f111d4a721571c1e96b106ff3f4f35cbf7e9acbf9a09ae756461292`
MD5	`8242cc10df56cbf43506b6c715473c73`
BLAKE2b-256	`c5a61f53d611a65a27b043e1f2c17eb9cb62ab2489e8f5057c68f2224f6ba09d`