Skip to main content

A fast multicore version of hdbscan for low dimensional euclidean spaces

Project description

HDBSCAN logo

Fast Multicore HDBSCAN

The fast_hdbscan library provides a simple implementation of the HDBSCAN clustering algorithm designed specifically for high performance on multicore machine with low dimensional data (2D to about 20D). The algorithm runs in parallel and can make effective use of as many cores as you wish to throw at a problem. It is thus ideal for large SMP systems, and even modern multicore laptops.

This library provides a re-implementation of a subset of the HDBSCAN algorithm that is compatible with the hdbscan library for data that is Euclidean and low dimensional. The primary advantages of this library over the standard hdbscan library are:

  • this library can easily use all available cores to speed up computation;

  • this library has much faster implementations of tree condensing and cluster extraction;

  • this library is much simpler and more approachable for extending or using components from;

  • this library is built on numba and has less issues with binaries and compilation.

This library does not support all the features and input formats available in the hdbscan library.

Basic Usage

The fast_hdbscan library follows the hdbscan library in using the sklearn API. You can use the fast_hdbscan class HDBSCAN exactly as you would that of the hdbscan library with the caveat that fast_hdbscan only supports a subset of the parameters and options of hdbscan. Nonetheless, if you have low-dimensional Euclidean data (e.g. the output of UMAP), you can use this library as a straightforward drop in replacement for hdbscan:

import fast_hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = fast_hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Installation

fast_hdbscan requires:

  • numba

  • numpy

  • scikit-learn

fast_hdbscan can be installed via pip:

pip install fast_hdbscan

To manually install this package:

wget https://github.com/TutteInstitute/fast_hdbscan/archive/main.zip
unzip main.zip
rm main.zip
cd fast_hdbscan-main
python setup.py install

References

The algorithm used here is an adaptation of the algorithms described in the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

License

fast_hdbscan is BSD (2-clause) licensed. See the LICENSE file for details.

Contributing

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from code to notebooks to examples and documentation are all equally valuable so please don’t feel you can’t contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_hdbscan-0.2.0.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

fast_hdbscan-0.2.0-py3-none-any.whl (17.7 kB view details)

Uploaded Python 3

File details

Details for the file fast_hdbscan-0.2.0.tar.gz.

File metadata

  • Download URL: fast_hdbscan-0.2.0.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for fast_hdbscan-0.2.0.tar.gz
Algorithm Hash digest
SHA256 002966c2adbac5170a6627ea2f685cdedcd2a1c2cb077401fba6748f0fbc39f0
MD5 76010f06820b105219dd5c1c0bb4a1a6
BLAKE2b-256 e5577f9f311bf2ba48f8718952a8a4cf491aaee4b906220295fe1cf213359b4e

See more details on using hashes here.

File details

Details for the file fast_hdbscan-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: fast_hdbscan-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 17.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for fast_hdbscan-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7315af09f5b19a4f1a8d6113b90eb4e98813d3bc0d55f05c0f24984252e164f2
MD5 dc8602b2f06ebfa953f6f4c615e4c5e6
BLAKE2b-256 0f493593953dbe9b3826fe937b3741641fe3ecd4587f53bd5e5961406e1f4642

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page