Skip to main content

A fast multicore version of HDBSCAN and PLSCAN clustering algorithms.

Project description

HDBSCAN logo

Fast Multicore HDBSCAN

The fast_hdbscan library provides an implementation of the HDBSCAN clustering algorithm designed specifically for high performance on multicore machine. The algorithm runs in parallel and can make effective use of as many cores as you wish to throw at a problem. It is thus ideal for large SMP systems, and even modern multicore laptops.

This library provides a re-implementation of a subset of the HDBSCAN algorithm that is compatible with the hdbscan library. There are specific optimizationsfor data that is Euclidean and low dimensional, other distance metrics and high dimensional data fallback to alternative parallel approaches that are faster than the hdbscan library, but not necessarily as performant as the highly optimized low-dimensional Euclidean case. The primary advantages of this library over the standard hdbscan library are:

  • this library can easily use all available cores to speed up computation;

  • this library has much faster implementations of tree condensing and cluster extraction;

  • this library is much simpler and more approachable for extending or using components from;

  • this library is built on numba and has less issues with binaries and compilation.

  • this library provides features such as semi-supervision, linking constraints, sample weights, and branch detection from FLASC, and an implemntation of PLSCAN.

This library does not support all the features and input formats available in the hdbscan library, but covers the most common use cases.

This library does support a number of research extensions to HDBSCAN including branch detection from FLASC and the semi-supervised clustering methods, as well as support for sample weights.

As a bonus this library also provides an easy to use implementation of the PLSCAN algorithm for automated cluster resolution selection and layered clustering.

Basic Usage

The fast_hdbscan library follows the hdbscan library in using the sklearn API. You can use the fast_hdbscan class HDBSCAN exactly as you would that of the hdbscan library with the caveat that fast_hdbscan only supports a subset of the parameters and options of hdbscan. Nonetheless, if you have low-dimensional Euclidean data (e.g. the output of UMAP), you can use this library as a straightforward drop in replacement for hdbscan:

import fast_hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = fast_hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

The first import of the package will take a while, as numba functions will be compiled for the first time. These functions are cached by default; you can tell numba to ignore the cache by setting the environment variable FAST_HDBSCAN_NUMBA_CACHE to ‘false’.

Aternatively, you can use the PLSCAN class to perform automated multiscale clustering:

import fast_hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = fast_hdbscan.PLSCAN()
cluster_labels = clusterer.fit_predict(data)
print(len(clusterer.cluster_layers_)) # number of layers found -- each layer is a layering at a different resolution

Installation

fast_hdbscan requires:

  • numba

  • numpy

  • scikit-learn

if you need more than just Euclidean distance, or support for high dimensional data, you will also need:

  • pynndescent

fast_hdbscan can be installed via pip:

pip install fast_hdbscan

To manually install this package:

wget https://github.com/TutteInstitute/fast_hdbscan/archive/main.zip
unzip main.zip
rm main.zip
cd fast_hdbscan-main
python setup.py install

References

The algorithm used here is an adaptation of the algorithms described in the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

The branch-detection functionality is adapted from:

D.M. Bot, J. Peeters, J. Liesenborgs, J. Aerts. FLASC: a flare-sensitive clustering algorithm. In: PeerJ Computer Science, Volume 11, e2792, 2025. https://doi.org/10.7717/peerj-cs.2792.

The PLSCAN functionality is adapted from:

D.M. Bot, L. McInnes, J. Aerts. Persistent Multiscale Density-based Clustering. In: arXiv preprint arXiv:2512.16558, 2025. https://arxiv.org/abs/2512.16558.

License

fast_hdbscan is BSD (2-clause) licensed. See the LICENSE file for details.

Contributing

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from code to notebooks to examples and documentation are all equally valuable so please don’t feel you can’t contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_hdbscan-0.3.2.tar.gz (59.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_hdbscan-0.3.2-py3-none-any.whl (62.6 kB view details)

Uploaded Python 3

File details

Details for the file fast_hdbscan-0.3.2.tar.gz.

File metadata

  • Download URL: fast_hdbscan-0.3.2.tar.gz
  • Upload date:
  • Size: 59.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for fast_hdbscan-0.3.2.tar.gz
Algorithm Hash digest
SHA256 248e09202eda04da4b84dd819de3cc714469912205bb447ad4c7136db580f205
MD5 33e7c041b2d4020df3be7e48cef9788d
BLAKE2b-256 2b3b235c47dd8282610522f9fa6869ad8b896c2d5927f82039f66d547229fc07

See more details on using hashes here.

File details

Details for the file fast_hdbscan-0.3.2-py3-none-any.whl.

File metadata

  • Download URL: fast_hdbscan-0.3.2-py3-none-any.whl
  • Upload date:
  • Size: 62.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for fast_hdbscan-0.3.2-py3-none-any.whl
Algorithm Hash digest
SHA256 619b42963935e6c3ab1e51de7895dd35260f3ef9ebe4627ff4317073fe51a667
MD5 bacf9c6aecf6ce9ebc0b9fb713477259
BLAKE2b-256 d09234dac2dff0877d637535bd8852bc68323f155799521c9ee94c27c548426f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page