Skip to main content

A fast multicore version of HDBSCAN and PLSCAN clustering algorithms.

Project description

HDBSCAN logo

Fast Multicore HDBSCAN

The fast_hdbscan library provides an implementation of the HDBSCAN clustering algorithm designed specifically for high performance on multicore machine. The algorithm runs in parallel and can make effective use of as many cores as you wish to throw at a problem. It is thus ideal for large SMP systems, and even modern multicore laptops.

This library provides a re-implementation of a subset of the HDBSCAN algorithm that is compatible with the hdbscan library. There are specific optimizationsfor data that is Euclidean and low dimensional, other distance metrics and high dimensional data fallback to alternative parallel approaches that are faster than the hdbscan library, but not necessarily as performant as the highly optimized low-dimensional Euclidean case. The primary advantages of this library over the standard hdbscan library are:

  • this library can easily use all available cores to speed up computation;

  • this library has much faster implementations of tree condensing and cluster extraction;

  • this library is much simpler and more approachable for extending or using components from;

  • this library is built on numba and has less issues with binaries and compilation.

  • this library provides features such as semi-supervision, linking constraints, sample weights, and branch detection from FLASC, and an implemntation of PLSCAN.

This library does not support all the features and input formats available in the hdbscan library, but covers the most common use cases.

This library does support a number of research extensions to HDBSCAN including branch detection from FLASC and the semi-supervised clustering methods, as well as support for sample weights.

As a bonus this library also provides an easy to use implementation of the PLSCAN algorithm for automated cluster resolution selection and layered clustering.

Basic Usage

The fast_hdbscan library follows the hdbscan library in using the sklearn API. You can use the fast_hdbscan class HDBSCAN exactly as you would that of the hdbscan library with the caveat that fast_hdbscan only supports a subset of the parameters and options of hdbscan. Nonetheless, if you have low-dimensional Euclidean data (e.g. the output of UMAP), you can use this library as a straightforward drop in replacement for hdbscan:

import fast_hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = fast_hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

The first import of the package will take a while, as numba functions will be compiled for the first time. These functions are cached by default; you can tell numba to ignore the cache by setting the environment variable FAST_HDBSCAN_NUMBA_CACHE to ‘false’.

Aternatively, you can use the PLSCAN class to perform automated multiscale clustering:

import fast_hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = fast_hdbscan.PLSCAN()
cluster_labels = clusterer.fit_predict(data)
print(len(clusterer.cluster_layers_)) # number of layers found -- each layer is a layering at a different resolution

Installation

fast_hdbscan requires:

  • numba

  • numpy

  • scikit-learn

if you need more than just Euclidean distance, or support for high dimensional data, you will also need:

  • pynndescent

fast_hdbscan can be installed via pip:

pip install fast_hdbscan

To manually install this package:

wget https://github.com/TutteInstitute/fast_hdbscan/archive/main.zip
unzip main.zip
rm main.zip
cd fast_hdbscan-main
python setup.py install

References

The algorithm used here is an adaptation of the algorithms described in the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

The branch-detection functionality is adapted from:

D.M. Bot, J. Peeters, J. Liesenborgs, J. Aerts. FLASC: a flare-sensitive clustering algorithm. In: PeerJ Computer Science, Volume 11, e2792, 2025. https://doi.org/10.7717/peerj-cs.2792.

The PLSCAN functionality is adapted from:

D.M. Bot, L. McInnes, J. Aerts. Persistent Multiscale Density-based Clustering. In: arXiv preprint arXiv:2512.16558, 2025. https://arxiv.org/abs/2512.16558.

License

fast_hdbscan is BSD (2-clause) licensed. See the LICENSE file for details.

Contributing

Contributions are more than welcome! If you have ideas for features of projects please get in touch. Everything from code to notebooks to examples and documentation are all equally valuable so please don’t feel you can’t contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged in.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_hdbscan-0.3.1.tar.gz (59.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_hdbscan-0.3.1-py3-none-any.whl (62.6 kB view details)

Uploaded Python 3

File details

Details for the file fast_hdbscan-0.3.1.tar.gz.

File metadata

  • Download URL: fast_hdbscan-0.3.1.tar.gz
  • Upload date:
  • Size: 59.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for fast_hdbscan-0.3.1.tar.gz
Algorithm Hash digest
SHA256 cc8879e19319aebf81700a349a66fe281be561a5f3891c6ddd8f949f7ef15320
MD5 483f4504212c5692bfcdf5ff7a7fa68a
BLAKE2b-256 03b8d585dabe3de4a7ea9f32f53aedf9949557b11133e6c6f05d974b205dcd7c

See more details on using hashes here.

File details

Details for the file fast_hdbscan-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: fast_hdbscan-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 62.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for fast_hdbscan-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4569a9292d476dfad5ea1d5860c6eac99b1c491afae8e5399f068493f38d8e27
MD5 efe6b89e7cfe080c1edec6a69c54db19
BLAKE2b-256 d70bf181a7f4202894c6119f95626e210e882036f324ce4380ba518ebc9b3524

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page