Skip to main content

Hierarchical Boundary Coefficient Clustering.

Project description

PyPI version Tests

Hierarchical Boundary Coefficient Clustering

HBCC is a clustering algorithm is inspired by the work of Peng et al. [4]. Their CDC algorithm detects regions that separate clusters by quantifying how spread out points' neighbors are in feature space. The intuition for this process is as follows: (1) points within a clusters have near neighbors on all sides, and (2) points in-between clusters or on the edge of a cluster predominantly have neighbors in the direction of the cluster's core.

The same intuition was previously used by Vandaele et al. [2] to quantify how close points are to the boundary of the data's manifold. They used their boundary coefficient to extract manifold skeletons.

HBCC combines Vandaele et al. [2]'s boundary coefficient with HDBSCAN's [1], [3] principled density-based cluster selection strategies. This approach removes CDC's ratio parameter. Instead a clustering hierarchy is evaluated over all boundary coefficient values and clusters are selected from the hierarchy.

The implementation relies on McInnes' fast_hdbscan and its support for lensed clustering introduced for Bot et al. [5] to detect branches.

The HBCC algorithm works as follows:

  1. Compute k nearest neighbors and minimum spanning tree [3].
  2. Compute n-hop distances from nearest neighbors [2]
  3. Compute the boundary coefficient for each point [2].
  4. Compute minimum spanning tree from boundary coefficient weighted knn--mst graph union [5].
  5. Compute HDBSCAN cluster hierarchy and selection [1].

How to use HBCC

import numpy as np
import matplotlib.pyplot as plt
from fast_hbcc import HBCC

data = np.load('data/flareable/flared_clusterable_data.npy')
c = HBCC(
    num_hops=4,
    min_samples=10,
    min_cluster_size=100,
    cluster_selection_method="leaf",
).fit(data)
plt.scatter(*data.T, c=c.labels_ % 10, s=2, cmap='tab10', vmin=0, vmax=9)
plt.axis("off")
plt.show()

HBCC cluster scatterplot

Example Notebooks

A notebook demonstrating how the algorithm works is available on nbviewer.

Citing

Citing information not available yet.

Licensing

The fast_hbcc package has a 2-Clause BSD license.

References

  • [1] Campello, R. J., Moulavi, D., & Sander, J. (2013, April). Density-based clustering based on hierarchical density estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 160-172). Springer Berlin Heidelberg.

  • [2] Vandaele, R., Saeys, Y., & De Bie, T. (2019). The Boundary Coefficient : a Vertex Measure for Visualizing and Finding Structure in Weighted Graphs. 15th International Workshop on Mining and Learning with Graphs (MLG).

  • [3] McInnes, L., & Healy, J. (2017). Accelerated Hierarchical Density Based Clustering. 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017-Novem, 33–42. https://doi.org/10.1109/ICDMW.2017.12.

  • [4] Peng, D., Gui, Z., Wang, D., Ma, Y., Huang, Z., Zhou, Y., & Wu, H. (2022). Clustering by measuring local direction centrality for data with heterogeneous density and weak connectivity. Nature Communications, 13(1), 1–14. https://doi.org/10.1038/s41467-022-33136-9.

  • [5] Bot, Daniël M., et al. "FLASC: A Flare-Sensitive Clustering Algorithm: Extending HDBSCAN* for Detecting Branches in Clusters." arXiv preprint arXiv:2311.15887 (2023). https://arxiv.org/abs/2311.15887.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fast_hbcc-0.1.0.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fast_hbcc-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file fast_hbcc-0.1.0.tar.gz.

File metadata

  • Download URL: fast_hbcc-0.1.0.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.1 CPython/3.11.11

File hashes

Hashes for fast_hbcc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 548a3f93221d87f2685f28c535d906dd5ff2d9d8381129358a1cd4c962ca6379
MD5 a081e7717e4dc5099e420494023a4ef9
BLAKE2b-256 ed0abac275dd1306aef202c50dcf67d148b164acfc2952064ee1fdf75aadc85c

See more details on using hashes here.

File details

Details for the file fast_hbcc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: fast_hbcc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.1 CPython/3.11.11

File hashes

Hashes for fast_hbcc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 448de6f10ccc3a26f04e32e0402cc63d03f8b18e279819620e2e4ea572209ac5
MD5 75590a40c2ac724233b09b2d94dc5118
BLAKE2b-256 1a66d6b523b1fa6d877d92611e11dbce466c95f60fd5484307aea1a2f004181d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page