Skip to main content

Dense Clustering for Mixed Data Types

Project description

Amazon DenseClus

build total download month download weekly download PyPI version PyPI - Python Version PyPI - Wheel PyPI - License Code style: black Github Super-Linter

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.

Installation

python3 -m pip install amazon-denseclus

Quick Start

DenseClus requires a Panda's dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!

from denseclus import DenseClus
from denseclus.utils import make_dataframe


df = make_dataframe()
clf = DenseClus(df)
clf.fit(df)

scores = clf.evaluate()
print(scores[0:10])

Usage

Prediction

DenseClus uses a predict method when umap_combine_method is set to ensemble. Results are return in 2d array with the first part being the labels and the second part the probabilities.

from denseclus import DenseClus
from denseclus.utils import make_dataframe

RANDOM_STATE = 10

df = make_dataframe(random_state=RANDOM_STATE)
train = df.sample(frac=0.8, random_state=RANDOM_STATE)
test = df.drop(train.index)
clf = DenseClus(random_state=RANDOM_STATE, umap_combine_method='ensemble')
clf.fit(train)

predictions = clf.predict(test)
print(predictions) # labels, probabilities

On Combination Method

For a slower but more stable results select intersection_union_mapper to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of the randomness of the algorithm.

clf = DenseClus(
    umap_combine_method="intersection_union_mapper",
)

To Use with GPU with Ensemble

To use with gpu first have rapids installed. You can do this as setup by providing cuda verision. pip install denseclus[gpu-cu12]

Then to run:

clf = DenseClus(
    umap_combine_method="ensemble",
    use_gpu=True
)

Advanced Usage

For advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing dictionaries into DenseClus class for either UMAP or HDBSCAN.

For example:

from denseclus import DenseClus
from denseclus.utils import make_dataframe

umap_params = {
    "categorical": {"n_neighbors": 15, "min_dist": 0.1},
    "numerical": {"n_neighbors": 20, "min_dist": 0.1},
}
hdbscan_params = {"min_cluster_size": 10}

df = make_dataframe()

clf = DenseClus(umap_combine_method="union"
             , umap_params=umap_params
             , hdbscan_params=hdbscan_params
             , random_state=None) # this will run in parallel

clf.fit(df)

Examples

Notebooks

A hands-on example with an overview of how to use is currently available in the form of a Example Jupyter Notebook.

Should you need to tune HDBSCAN, here is an optional approach: Tuning with HDBSCAN Notebook

Should you need to validate UMAP emeddings, there is an approach to do so in the Validation for UMAP Notebook

Blogs

AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data

TDS Blog: How To Tune HDBSCAN

TDS Blog: On the Validation of UMAP

References

@article{mcinnes2018umap-software,
  title={UMAP: Uniform Manifold Approximation and Projection},
  author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas},
  journal={The Journal of Open Source Software},
  volume={3},
  number={29},
  pages={861},
  year={2018}
}
@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon_denseclus-0.2.2.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

amazon_denseclus-0.2.2-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file amazon_denseclus-0.2.2.tar.gz.

File metadata

  • Download URL: amazon_denseclus-0.2.2.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.13

File hashes

Hashes for amazon_denseclus-0.2.2.tar.gz
Algorithm Hash digest
SHA256 e6eaf3e8782e2b1061a8cc57419a3dc427dce58c26f6bac5082911bce5da2309
MD5 c62e664d1f88e49775f8a397528e10e0
BLAKE2b-256 7045ba847b60fea69e7f416efd56841d72502549ea3421ca1c665520ee18f576

See more details on using hashes here.

File details

Details for the file amazon_denseclus-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for amazon_denseclus-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0b7a4bd6dca2e141ad0302bffd8f517ae73299dabc25082dc903acbf63190efb
MD5 55b5969058c0588e564427459857a1eb
BLAKE2b-256 f186d6104f070ca50bcb55c6a939ab9bd3fdd0d317db9d93e09ea7be8450e7d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page