Skip to main content

Clustering based on density with variable density clusters

Project description

PyPI Version Conda-forge Version Conda-forge downloads License Travis Build Status https://codecov.io/gh/scikit-learn-contrib/hdbscan/branch/master/graph/badge.svg Docs JOSS article Launch example notebooks in Binder

HDBSCAN

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning – and the primary parameter, minimum cluster size, is intuitive and easy to select.

HDBSCAN is ideal for exploratory data analysis; it’s a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).

Based on the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

Documentation, including tutorials, are available on ReadTheDocs at http://hdbscan.readthedocs.io/en/latest/ .

Notebooks comparing HDBSCAN to other clustering algorithms, explaining how HDBSCAN works and comparing performance with other python clustering implementations are available.

How to use HDBSCAN

The hdbscan package inherits from sklearn classes, and thus drops in neatly next to other sklearn clusterers with an identical calling API. Similarly it supports input in a variety of formats: an array (or pandas dataframe, or sparse matrix) of shape (num_samples x num_features); an array (or sparse matrix) giving a distance matrix between samples.

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Performance

Significant effort has been put into making the hdbscan implementation as fast as possible. It is orders of magnitude faster than the reference implementation in Java, and is currently faster than highly optimized single linkage implementations in C and C++. version 0.7 performance can be seen in this notebook . In particular performance on low dimensional data is better than sklearn’s DBSCAN , and via support for caching with joblib, re-clustering with different parameters can be almost free.

Additional functionality

The hdbscan package comes equipped with visualization tools to help you understand your clustering results. After fitting data the clusterer object has attributes for:

  • The condensed cluster hierarchy

  • The robust single linkage cluster hierarchy

  • The reachability distance minimal spanning tree

All of which come equipped with methods for plotting and converting to Pandas or NetworkX for further analysis. See the notebook on how HDBSCAN works for examples and further details.

The clusterer objects also have an attribute providing cluster membership strengths, resulting in optional soft clustering (and no further compute expense). Finally each cluster also receives a persistence score giving the stability of the cluster over the range of distance scales present in the data. This provides a measure of the relative strength of clusters.

Outlier Detection

The HDBSCAN clusterer objects also support the GLOSH outlier detection algorithm. After fitting the clusterer to data the outlier scores can be accessed via the outlier_scores_ attribute. The result is a vector of score values, one for each data point that was fit. Higher scores represent more outlier like objects. Selecting outliers via upper quantiles is often a good approach.

Based on the paper:

R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection, ACM Trans. on Knowledge Discovery from Data, Vol 10, 1 (July 2015), 1-51.

Robust single linkage

The hdbscan package also provides support for the robust single linkage clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN implementation this is a high performance version of the algorithm outperforming scipy’s standard single linkage implementation. The robust single linkage hierarchy is available as an attribute of the robust single linkage clusterer, again with the ability to plot or export the hierarchy, and to extract flat clusterings at a given cut level and gamma value.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict(data)
hierarchy = clusterer.cluster_hierarchy_
alt_labels = hierarchy.get_clusters(0.100, 5)
hierarchy.plot()
Based on the paper:

K. Chaudhuri and S. Dasgupta. “Rates of convergence for the cluster tree.” In Advances in Neural Information Processing Systems, 2010.

Branch detection

The hdbscan package supports a branch-detection post-processing step by Bot et al.. Cluster shapes, such as branching structures, can reveal interesting patterns that are not expressed in density-based cluster hierarchies. The BranchDetector class mimics the HDBSCAN API and can be used to detect branching hierarchies in clusters. It provides condensed branch hierarchies, branch persistences, and branch memberships and supports joblib’s caching functionality. A notebook demonstrating the BranchDetector is available.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(branch_detection_data=True).fit(data)
branch_detector = hdbscan.BranchDetector().fit(clusterer)
branch_detector.cluster_approximation_graph_.plot(edge_width=0.1)
Based on the paper:

D.M. Bot, J. Peeters, J. Liesenborgs and J. Aerts FLASC: a flare-sensitive clustering algorithm. PeerJ Computer Science, Vol 11, April 2025, e2792. https://doi.org/10.7717/peerj-cs.2792.

Installing

Easiest install, if you have Anaconda (thanks to conda-forge which is awesome!):

conda install -c conda-forge hdbscan

PyPI install, presuming you have an up to date pip:

pip install hdbscan

Binary wheels for a number of platforms are available thanks to the work of Ryan Helinski <rlhelinski@gmail.com>.

If pip is having difficulties pulling the dependencies then we’d suggest to first upgrade pip to at least version 10 and try again:

pip install --upgrade pip
pip install hdbscan

Otherwise install the dependencies manually using anaconda followed by pulling hdbscan from pip:

conda install cython
conda install numpy scipy
conda install scikit-learn
pip install hdbscan

For a manual install of the latest code directly from GitHub:

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

Alternatively download the package, install requirements, and manually run the installer:

wget https://github.com/scikit-learn-contrib/hdbscan/archive/master.zip
unzip master.zip
rm master.zip
cd hdbscan-master

pip install -r requirements.txt

python setup.py install

Running the Tests

The package tests can be run after installation using the command:

nosetests -s hdbscan

or, if nose is installed but nosetests is not in your PATH variable:

python -m nose -s hdbscan

If one or more of the tests fail, please report a bug at https://github.com/scikit-learn-contrib/hdbscan/issues/new

Python Version

The hdbscan library supports both Python 2 and Python 3. However we recommend Python 3 as the better option if it is available to you.

Help and Support

For simple issues you can consult the FAQ in the documentation. If your issue is not suitably resolved there, please check the issues on github. Finally, if no solution is available there feel free to open an issue ; the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}

To reference the high performance algorithm developed in this library please cite our paper in ICDMW 2017 proceedings.

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

@inproceedings{mcinnes2017accelerated,
  title={Accelerated Hierarchical Density Based Clustering},
  author={McInnes, Leland and Healy, John},
  booktitle={Data Mining Workshops (ICDMW), 2017 IEEE International Conference on},
  pages={33--42},
  year={2017},
  organization={IEEE}
}

If you used the branch-detection functionality in this library please cite our PeerJ paper:

Bot DM, Peeters J, Liesenborgs J, Aerts J. FLASC: a flare-sensitive clustering algorithm. In: PeerJ Computer Science, Volume 11, e2792, 2025. https://doi.org/10.7717/peerj-cs.2792

@article{bot2025flasc,
    title   = {{FLASC: a flare-sensitive clustering algorithm}},
    author  = {Bot, Dani{\"{e}}l M. and Peeters, Jannes and Liesenborgs, Jori and Aerts, Jan},
    year    = {2025},
    month   = {apr},
    journal = {PeerJ Comput. Sci.},
    volume  = {11},
    pages   = {e2792},
    issn    = {2376-5992},
    doi     = {10.7717/peerj-cs.2792},
    url     = {https://peerj.com/articles/cs-2792},
}

Licensing

The hdbscan package is 3-clause BSD licensed. Enjoy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdbscan-0.8.42.tar.gz (7.1 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hdbscan-0.8.42-cp313-cp313-win_amd64.whl (670.6 kB view details)

Uploaded CPython 3.13Windows x86-64

hdbscan-0.8.42-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (4.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

hdbscan-0.8.42-cp313-cp313-macosx_10_13_universal2.whl (1.4 MB view details)

Uploaded CPython 3.13macOS 10.13+ universal2 (ARM64, x86-64)

hdbscan-0.8.42-cp312-cp312-win_amd64.whl (670.6 kB view details)

Uploaded CPython 3.12Windows x86-64

hdbscan-0.8.42-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

hdbscan-0.8.42-cp312-cp312-macosx_10_13_universal2.whl (1.4 MB view details)

Uploaded CPython 3.12macOS 10.13+ universal2 (ARM64, x86-64)

hdbscan-0.8.42-cp311-cp311-win_amd64.whl (686.0 kB view details)

Uploaded CPython 3.11Windows x86-64

hdbscan-0.8.42-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

hdbscan-0.8.42-cp311-cp311-macosx_10_9_universal2.whl (1.4 MB view details)

Uploaded CPython 3.11macOS 10.9+ universal2 (ARM64, x86-64)

hdbscan-0.8.42-cp310-cp310-win_amd64.whl (685.9 kB view details)

Uploaded CPython 3.10Windows x86-64

hdbscan-0.8.42-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (4.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64manylinux: glibc 2.28+ x86-64

hdbscan-0.8.42-cp310-cp310-macosx_15_0_x86_64.whl (753.7 kB view details)

Uploaded CPython 3.10macOS 15.0+ x86-64

File details

Details for the file hdbscan-0.8.42.tar.gz.

File metadata

  • Download URL: hdbscan-0.8.42.tar.gz
  • Upload date:
  • Size: 7.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.20

File hashes

Hashes for hdbscan-0.8.42.tar.gz
Algorithm Hash digest
SHA256 3bd749a3df39c7e965bd8b2173c3804cdb11ad73d524a5df1201360814293614
MD5 dc51438bc7502c5d544bd6e50cf65ad1
BLAKE2b-256 e3e1f0d795e4b015f9e210d1e75a0fc538722a68152044b919d26ff30479aff4

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.42-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 670.6 kB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for hdbscan-0.8.42-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 990b9f9ce14f290eb8bd9343048cb50b890560de99ced4fb31c486cb0c9f0f74
MD5 8843eb9b8de6bb1994be29b0d545b8ac
BLAKE2b-256 d2f75f91bf58a8519cf91ddfc815f560ae6fa507a12902ee9fb6b54c72ac6bcd

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.42-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e14d7309c91e1a59f592936555fe4282e66061719225d9cb2d7bf18040bb8b54
MD5 2ece8a827aebde3c3566dede83e5ca46
BLAKE2b-256 de814c36b1d3363d9f7c831994a8cae073941798915f86e80c8863cbbe161df0

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.42-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 7e0f160ee0d5e61d9a2411c44fa41c12849e814899851cab8e17e924487018e1
MD5 677c77d726c34bbaba2fb8f9b6e09c03
BLAKE2b-256 868082a3b7f17dafe38d5bd0d59add7318149898c657837ceba6815bfa3214dc

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.42-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 670.6 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for hdbscan-0.8.42-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 fd9f0d5f65a5aa4437b8f69ff8cb4ae6d42723e543254dca49c62f02192c2791
MD5 fd08e2de9b09de7374e86a1bcb463f52
BLAKE2b-256 8a95fbe79ed618e869c9bd9f70429e04c69ce7dabd24d42cb3d5ce8fb44c5268

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.42-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 f0cc0f279f0f83203277fb5b09422cbfab577fe269e3f71d0082354f65d71cf3
MD5 21bb989a8e0e5c90fd57d9cb0dff9b00
BLAKE2b-256 26b46592160ec00d660ef3e3754644a98503d865947480e2f6f1c1fb6f284a64

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.42-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 31541afa4ce2d42ad828ffd5da1bf40d8212fa8318cc28e58c65ffe719e9083b
MD5 c11cc5e5728ab5544ca98147c4d8e668
BLAKE2b-256 96256a24f09f857593b8f3bcb9af523fa45fd072e27e015f83f172f380981cb7

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.42-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 686.0 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for hdbscan-0.8.42-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 f265f1ae267713c7a8dfa14ddc530c1ccf87905cf003db65769b9afed519d910
MD5 3615afd07d299e35baeae4564f101f73
BLAKE2b-256 78898f851a6506392029e712bae10f5816b21f2f277e1e468bca5dc03d8dfde2

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.42-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 cd86e30bfe1f1363b9b97a8a84cbeebf761c5ff4262037f9cf0d068b590fe541
MD5 340058a9d95627f9484bdcef365003ca
BLAKE2b-256 78a07a0fda43d4542d268d47f7741bd5e480043cc022cb5e91acef2d4ee1ca6d

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.42-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 bc428bff42b9ec8ecbaf5d790b6b6de9e2cb120059350d0caecb92311f44b869
MD5 685d1576764d9d3c47831e0b3884b7b6
BLAKE2b-256 04a4877e2d795924d43230a1140e0402f720728ea1dc0d6d67e0f889bf4a6b36

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.42-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 685.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for hdbscan-0.8.42-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 2c86d215b5940a5414ceb468af5987941d3526e49d09418296830528161ccdd6
MD5 7de27e01e86d1608b326843260c1c547
BLAKE2b-256 ea32cd9c61acca95811069bf99166a7651ea7d0e1936b52283ab535295a58858

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.42-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2ab4c7fba52f648fe4276d7b60f6831a9009089872a5ca05f91d6ed1bbe52d23
MD5 37ac6cf561091f388af8725945e9878e
BLAKE2b-256 eedd577da53ec500dec3350350e35143fc0819e961d979f86cc4de03b565c797

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.42-cp310-cp310-macosx_15_0_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.42-cp310-cp310-macosx_15_0_x86_64.whl
Algorithm Hash digest
SHA256 1459d777d16800361b504656982ae3988fc412b97a8f244ffbd565c72a39ca41
MD5 e1ff8264bcd82443122b60492d56d1fd
BLAKE2b-256 ae0c413d01df176d44f0e322157a58459f8ced3bd4df4da3252e93b68f407dce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page