Dense Clustering for Mixed Data Types
Project description
Amazon DenseClus
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.
Installation
python3 -m pip install amazon-denseclus
Quick Start
DenseClus requires a Panda's dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!
from denseclus import DenseClus
from denseclus.utils import make_dataframe
df = make_dataframe()
clf = DenseClus(df)
clf.fit(df)
scores = clf.evaluate()
print(scores[0:10])
Usage
Prediction
DenseClus uses a predict
method when umap_combine_method
is set to ensemble
.
Results are return in 2d array with the first part being the labels and the second part the probabilities.
from denseclus import DenseClus
from denseclus.utils import make_dataframe
RANDOM_STATE = 10
df = make_dataframe(random_state=RANDOM_STATE)
train = df.sample(frac=0.8, random_state=RANDOM_STATE)
test = df.drop(train.index)
clf = DenseClus(random_state=RANDOM_STATE, umap_combine_method='ensemble')
clf.fit(train)
predictions = clf.predict(test)
print(predictions) # labels, probabilities
On Combination Method
For a slower but more stable results select intersection_union_mapper
to combine embedding layers via a third UMAP, which will provide equal weight to both numerics and categoriel columns. By default, you are setting the random seed which eliminates the ability for UMAP to run in parallel but will help circumevent some of the randomness of the algorithm.
clf = DenseClus(
umap_combine_method="intersection_union_mapper",
)
To Use with GPU with Ensemble
To use with gpu first have rapids installed.
You can do this as setup by providing cuda verision.
pip install denseclus[gpu-cu12]
Then to run:
clf = DenseClus(
umap_combine_method="ensemble",
use_gpu=True
)
Advanced Usage
For advanced users, it's possible to select more fine-grained control of the underlying algorithms by passing
dictionaries into DenseClus
class for either UMAP or HDBSCAN.
For example:
from denseclus import DenseClus
from denseclus.utils import make_dataframe
umap_params = {
"categorical": {"n_neighbors": 15, "min_dist": 0.1},
"numerical": {"n_neighbors": 20, "min_dist": 0.1},
}
hdbscan_params = {"min_cluster_size": 10}
df = make_dataframe()
clf = DenseClus(umap_combine_method="union"
, umap_params=umap_params
, hdbscan_params=hdbscan_params
, random_state=None) # this will run in parallel
clf.fit(df)
Examples
Notebooks
A hands-on example with an overview of how to use is currently available in the form of a Example Jupyter Notebook.
Should you need to tune HDBSCAN, here is an optional approach: Tuning with HDBSCAN Notebook
Should you need to validate UMAP emeddings, there is an approach to do so in the Validation for UMAP Notebook
Blogs
AWS Blog: Introducing DenseClus, an open source clustering package for mixed-type data
TDS Blog: On the Validation of UMAP
References
@article{mcinnes2018umap-software,
title={UMAP: Uniform Manifold Approximation and Projection},
author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas},
journal={The Journal of Open Source Software},
volume={3},
number={29},
pages={861},
year={2018}
}
@article{mcinnes2017hdbscan,
title={hdbscan: Hierarchical density based clustering},
author={McInnes, Leland and Healy, John and Astels, Steve},
journal={The Journal of Open Source Software},
volume={2},
number={11},
pages={205},
year={2017}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file amazon_denseclus-0.2.2.tar.gz
.
File metadata
- Download URL: amazon_denseclus-0.2.2.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6eaf3e8782e2b1061a8cc57419a3dc427dce58c26f6bac5082911bce5da2309 |
|
MD5 | c62e664d1f88e49775f8a397528e10e0 |
|
BLAKE2b-256 | 7045ba847b60fea69e7f416efd56841d72502549ea3421ca1c665520ee18f576 |
File details
Details for the file amazon_denseclus-0.2.2-py3-none-any.whl
.
File metadata
- Download URL: amazon_denseclus-0.2.2-py3-none-any.whl
- Upload date:
- Size: 13.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b7a4bd6dca2e141ad0302bffd8f517ae73299dabc25082dc903acbf63190efb |
|
MD5 | 55b5969058c0588e564427459857a1eb |
|
BLAKE2b-256 | f186d6104f070ca50bcb55c6a939ab9bd3fdd0d317db9d93e09ea7be8450e7d2 |