Skip to main content

Clustering based on density with variable density clusters

Project description

PyPI Version Conda-forge Version Conda-forge downloads License Travis Build Status https://codecov.io/gh/scikit-learn-contrib/hdbscan/branch/master/graph/badge.svg Docs JOSS article Launch example notebooks in Binder

HDBSCAN

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning – and the primary parameter, minimum cluster size, is intuitive and easy to select.

HDBSCAN is ideal for exploratory data analysis; it’s a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).

Based on the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

Documentation, including tutorials, are available on ReadTheDocs at http://hdbscan.readthedocs.io/en/latest/ .

Notebooks comparing HDBSCAN to other clustering algorithms, explaining how HDBSCAN works and comparing performance with other python clustering implementations are available.

How to use HDBSCAN

The hdbscan package inherits from sklearn classes, and thus drops in neatly next to other sklearn clusterers with an identical calling API. Similarly it supports input in a variety of formats: an array (or pandas dataframe, or sparse matrix) of shape (num_samples x num_features); an array (or sparse matrix) giving a distance matrix between samples.

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Performance

Significant effort has been put into making the hdbscan implementation as fast as possible. It is orders of magnitude faster than the reference implementation in Java, and is currently faster than highly optimized single linkage implementations in C and C++. version 0.7 performance can be seen in this notebook . In particular performance on low dimensional data is better than sklearn’s DBSCAN , and via support for caching with joblib, re-clustering with different parameters can be almost free.

Additional functionality

The hdbscan package comes equipped with visualization tools to help you understand your clustering results. After fitting data the clusterer object has attributes for:

  • The condensed cluster hierarchy

  • The robust single linkage cluster hierarchy

  • The reachability distance minimal spanning tree

All of which come equipped with methods for plotting and converting to Pandas or NetworkX for further analysis. See the notebook on how HDBSCAN works for examples and further details.

The clusterer objects also have an attribute providing cluster membership strengths, resulting in optional soft clustering (and no further compute expense). Finally each cluster also receives a persistence score giving the stability of the cluster over the range of distance scales present in the data. This provides a measure of the relative strength of clusters.

Outlier Detection

The HDBSCAN clusterer objects also support the GLOSH outlier detection algorithm. After fitting the clusterer to data the outlier scores can be accessed via the outlier_scores_ attribute. The result is a vector of score values, one for each data point that was fit. Higher scores represent more outlier like objects. Selecting outliers via upper quantiles is often a good approach.

Based on the paper:

R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection, ACM Trans. on Knowledge Discovery from Data, Vol 10, 1 (July 2015), 1-51.

Robust single linkage

The hdbscan package also provides support for the robust single linkage clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN implementation this is a high performance version of the algorithm outperforming scipy’s standard single linkage implementation. The robust single linkage hierarchy is available as an attribute of the robust single linkage clusterer, again with the ability to plot or export the hierarchy, and to extract flat clusterings at a given cut level and gamma value.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict(data)
hierarchy = clusterer.cluster_hierarchy_
alt_labels = hierarchy.get_clusters(0.100, 5)
hierarchy.plot()
Based on the paper:

K. Chaudhuri and S. Dasgupta. “Rates of convergence for the cluster tree.” In Advances in Neural Information Processing Systems, 2010.

Branch detection

The hdbscan package supports a branch-detection post-processing step by Bot et al.. Cluster shapes, such as branching structures, can reveal interesting patterns that are not expressed in density-based cluster hierarchies. The BranchDetector class mimics the HDBSCAN API and can be used to detect branching hierarchies in clusters. It provides condensed branch hierarchies, branch persistences, and branch memberships and supports joblib’s caching functionality. A notebook demonstrating the BranchDetector is available.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(branch_detection_data=True).fit(data)
branch_detector = hdbscan.BranchDetector().fit(clusterer)
branch_detector.cluster_approximation_graph_.plot(edge_width=0.1)
Based on the paper:

D. M. Bot, J. Peeters, J. Liesenborgs and J. Aerts “FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN* for Detecting Branches in Clusters” Arxiv 2311.15887, 2023.

Installing

Easiest install, if you have Anaconda (thanks to conda-forge which is awesome!):

conda install -c conda-forge hdbscan

PyPI install, presuming you have an up to date pip:

pip install hdbscan

Binary wheels for a number of platforms are available thanks to the work of Ryan Helinski <rlhelinski@gmail.com>.

If pip is having difficulties pulling the dependencies then we’d suggest to first upgrade pip to at least version 10 and try again:

pip install --upgrade pip
pip install hdbscan

Otherwise install the dependencies manually using anaconda followed by pulling hdbscan from pip:

conda install cython
conda install numpy scipy
conda install scikit-learn
pip install hdbscan

For a manual install of the latest code directly from GitHub:

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

Alternatively download the package, install requirements, and manually run the installer:

wget https://github.com/scikit-learn-contrib/hdbscan/archive/master.zip
unzip master.zip
rm master.zip
cd hdbscan-master

pip install -r requirements.txt

python setup.py install

Running the Tests

The package tests can be run after installation using the command:

nosetests -s hdbscan

or, if nose is installed but nosetests is not in your PATH variable:

python -m nose -s hdbscan

If one or more of the tests fail, please report a bug at https://github.com/scikit-learn-contrib/hdbscan/issues/new

Python Version

The hdbscan library supports both Python 2 and Python 3. However we recommend Python 3 as the better option if it is available to you.

Help and Support

For simple issues you can consult the FAQ in the documentation. If your issue is not suitably resolved there, please check the issues on github. Finally, if no solution is available there feel free to open an issue ; the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}

To reference the high performance algorithm developed in this library please cite our paper in ICDMW 2017 proceedings.

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

@inproceedings{mcinnes2017accelerated,
  title={Accelerated Hierarchical Density Based Clustering},
  author={McInnes, Leland and Healy, John},
  booktitle={Data Mining Workshops (ICDMW), 2017 IEEE International Conference on},
  pages={33--42},
  year={2017},
  organization={IEEE}
}

If you used the branch-detection functionality in this codebase in a scientific publication and which to cite it, please use the Arxiv preprint:

D. M. Bot, J. Peeters, J. Liesenborgs and J. Aerts “FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN* for Detecting Branches in Clusters” Arxiv 2311.15887, 2023.

@misc{bot2023flasc,
    title={FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN* for Detecting Branches in Clusters},
    author={D. M. Bot and J. Peeters and J. Liesenborgs and J. Aerts},
    year={2023},
    eprint={2311.15887},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2311.15887},
}

Licensing

The hdbscan package is 3-clause BSD licensed. Enjoy.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hdbscan-0.8.40.tar.gz (6.9 MB view details)

Uploaded Source

Built Distributions

hdbscan-0.8.40-cp312-cp312-win_amd64.whl (726.2 kB view details)

Uploaded CPython 3.12 Windows x86-64

hdbscan-0.8.40-cp312-cp312-macosx_10_13_universal2.whl (1.5 MB view details)

Uploaded CPython 3.12 macOS 10.13+ universal2 (ARM64, x86-64)

hdbscan-0.8.40-cp311-cp311-win_amd64.whl (732.2 kB view details)

Uploaded CPython 3.11 Windows x86-64

hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

hdbscan-0.8.40-cp311-cp311-macosx_10_9_universal2.whl (1.5 MB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

hdbscan-0.8.40-cp310-cp310-win_amd64.whl (730.9 kB view details)

Uploaded CPython 3.10 Windows x86-64

hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

hdbscan-0.8.40-cp310-cp310-macosx_12_0_x86_64.whl (813.7 kB view details)

Uploaded CPython 3.10 macOS 12.0+ x86-64

hdbscan-0.8.40-cp39-cp39-win_amd64.whl (811.2 kB view details)

Uploaded CPython 3.9 Windows x86-64

hdbscan-0.8.40-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB view details)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

hdbscan-0.8.40-cp39-cp39-macosx_12_0_x86_64.whl (814.9 kB view details)

Uploaded CPython 3.9 macOS 12.0+ x86-64

hdbscan-0.8.40-cp38-cp38-win_amd64.whl (814.0 kB view details)

Uploaded CPython 3.8 Windows x86-64

hdbscan-0.8.40-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.3 MB view details)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

hdbscan-0.8.40-cp38-cp38-macosx_12_0_x86_64.whl (820.5 kB view details)

Uploaded CPython 3.8 macOS 12.0+ x86-64

File details

Details for the file hdbscan-0.8.40.tar.gz.

File metadata

  • Download URL: hdbscan-0.8.40.tar.gz
  • Upload date:
  • Size: 6.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.20

File hashes

Hashes for hdbscan-0.8.40.tar.gz
Algorithm Hash digest
SHA256 c9e383ff17beee0591075ff65d524bda5b5a35dfb01d218245a7ba30c8d48a17
MD5 03717f484ee6e82a5da06a00d4999372
BLAKE2b-256 c1846b010387b795f774e1ec695df3c8660c15abd041783647d5e7e4076bfc6b

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.40-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 726.2 kB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for hdbscan-0.8.40-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 1b55a935ed7b329adac52072e1c4028979dfc54312ca08de2deece9c97d6ebb1
MD5 5dc8bfb9e23cd12bc77313e0c4a72ada
BLAKE2b-256 c0cb6b4254f8a33e075118512e55acf3485c155ea52c6c35d69a985bdc59297c

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp312-cp312-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp312-cp312-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 353eaa22e42bee69df095744dbb8b29360e516bd9dcb84580dceeeb755f004cc
MD5 ccfe4322280c91a7b48e948acc307eb4
BLAKE2b-256 33ff4739886abb990dc6feb7b02eafb38a7eaf090fffef6336e70a03d693f433

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.40-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 732.2 kB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for hdbscan-0.8.40-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 127cbe8c858dc77adfde33a3e1ce4f3bea810f78b01d2bd47b1147d4b5a50472
MD5 789d081d4f610cabf0e668423f45c999
BLAKE2b-256 64b196c347c7740efa1ac803be64155159284f92fafcff88c1077344e64eead5

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6e0d6197ee045b173e1f16e6884386f335a56091e373a839dd24f7331a8fa9ed
MD5 5691715842e139e8664b90012a9780ba
BLAKE2b-256 a3ef32c8a0b3dc6e6c4e433b85b30c3723d8eb48d115c0185b82ab89e1a0ef89

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 5e958f0d7a33cd2b5e8e927b47f7360bf8a3e7d72355dd65a701e8aabe407b27
MD5 2f5853b6b45357100ff80203940456b6
BLAKE2b-256 266b88b8c8023c0c0b27589ad83c82084a1b751917a3e09bdf7fcacf7e6bd523

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.40-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 730.9 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.11

File hashes

Hashes for hdbscan-0.8.40-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9ba82e510508921e0b30a234b639f5d84a7d475746e7db814517c5c4d1589016
MD5 2a172371e7ca03bd599e1b5d0621a510
BLAKE2b-256 97ebfd2093176b439d6741e92996f6d1d7273ddb0819de59934bc64fe2b1c308

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cda06a6f4e65c6c34bed083bb8cdf29fdb1ffcb15580829d79b2906c7bdc6dbc
MD5 8cf4c9f40c966367a8b62eb9fed8b723
BLAKE2b-256 8ad911564d3ebfe7429fb2e54356b07b2e44ac3dca668c47401d98170809a2f6

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp310-cp310-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp310-cp310-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 811a248e57353a4aa815019176879fd16bace55ed633583a6b47734edcb5397c
MD5 14704b22e105de9e022cac97573323a9
BLAKE2b-256 2cce489bb941c77e67f6bdc7b47f28318780fd478db655113a073cf7c89cd8f5

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.40-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 811.2 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.9.13

File hashes

Hashes for hdbscan-0.8.40-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 56d3057d483d112ff8e0f0a49f0d59df8c078d444dbd5dea7b987faab0c6fb49
MD5 c801302c674a8fc9e534092ea3100595
BLAKE2b-256 a03296299e30b21476c5c3073b1be85a5d12a078d21cd5e1b97b37ecd8ae1b30

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 05668ae7a17479a9061676290a66a810a62f2a4ec577ba18f561088b726ab01d
MD5 021e93846df390e53e3d4e2e6ccb676f
BLAKE2b-256 a9f491149998cd0dbc32b5db911dd13ac490b1801b255830d68c474acaa0827e

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp39-cp39-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp39-cp39-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 cf094aeea4df4513644333b9dda408ef8ef385d0ab5f3f8681e239a9008cbeb5
MD5 236a44af90ae8d88ffe202e3e4a35f9c
BLAKE2b-256 9aafacf0a9fd7ed549ced9186fb754a0a28fb17116618a78195ab20321f77052

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: hdbscan-0.8.40-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 814.0 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.10

File hashes

Hashes for hdbscan-0.8.40-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 7ebe69a0ad2f86d090a518b17d4635dfc65d3402b8c453aa2942f9c7dc895b9e
MD5 11c3d58f3c826497c9a7736ac951bc26
BLAKE2b-256 4cceb97d9d4e8074a3231113c0fc6552e8cebfdbfcc361eeccf96f6a1765af39

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c5a16f38e1816ab69ad315a1eab429e5a7c725210d88e71d273496cce3a2693c
MD5 ed1f2faf71cb12ac86b428eca92a335c
BLAKE2b-256 89dacccaa49b0b7480ca69e27489835b102d5e8997cda144abf829c8db2a4131

See more details on using hashes here.

File details

Details for the file hdbscan-0.8.40-cp38-cp38-macosx_12_0_x86_64.whl.

File metadata

File hashes

Hashes for hdbscan-0.8.40-cp38-cp38-macosx_12_0_x86_64.whl
Algorithm Hash digest
SHA256 c18947947af7f843f47c0111f21ffd5a5fd31789fcae39689a44e8b01433e504
MD5 e502fe76645bd69040e38fa9b8637c2e
BLAKE2b-256 8e235b00c64e06675eed9c68bd42dff45489efb1dd6d18dcc7b7f3564c275441

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page