Skip to main content

Dense Clustering for Mixed Data Types

Project description

Amazon DenseClus

build total download month download weekly download PyPI version PyPI - Python Version PyPI - Wheel PyPI - License Code style: black Github Super-Linter

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.

Installation

python3 -m pip install Amazon-DenseClus

Quick Start

DenseClus requires a Panda's dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!

from denseclus import DenseClus
from denseclus.utils import make_dataframe

df = make_dataframe()

clf = DenseClus()
clf.fit(df)

print(clf.score())

Usage

For a slower but more stable results select intersection_union_mapper to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of the randomness of the algorithm.

clf = DenseClus(
    umap_combine_method="intersection_union_mapper",
)

Advanced Usage

For advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing dictionaries into DenseClus class.

For example:

from denseclus import DenseClus
from denseclus.utils import make_dataframe

umap_params = {
    "categorical": {"n_neighbors": 15, "min_dist": 0.1},
    "numerical": {"n_neighbors": 20, "min_dist": 0.1},
}
hdbscan_params = {"min_cluster_size": 10}

df = make_dataframe()

clf = DenseClus(umap_combine_method="union"
             , umap_params=umap_params
             , hdbscan_params=hdbscan_params
             , random_state=None) # this will run in parallel

clf.fit(df)

Examples

Notebooks

A hands-on example with an overview of how to use is currently available in the form of a Example Jupyter Notebook.

Should you need to tune HDBSCAN, here is an optional approach: Tuning with HDBSCAN Notebook

Should you need to validate UMAP emeddings, there is an approach to do so in the Validation for UMAP Notebook

Blogs

AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data

TDS Blog: How To Tune HDBSCAN

TDS Blog: On the Validation of UMAP

References

@article{mcinnes2018umap-software,
  title={UMAP: Uniform Manifold Approximation and Projection},
  author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas},
  journal={The Journal of Open Source Software},
  volume={3},
  number={29},
  pages={861},
  year={2018}
}
@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon_denseclus-0.1.1.tar.gz (1.9 MB view hashes)

Uploaded Source

Built Distributions

amazon_denseclus-0.1.1-py3-none-any.whl (10.3 kB view hashes)

Uploaded Python 3

Amazon_DenseClus-0.1.1-py3-none-any.whl (10.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page