Skip to main content

A Python library for advanced clustering algorithms

Project description


PyPI version TestMain CircleCI codecov Docs

The package provides a simple way to perform clustering in Python. For this purpose it provides a variety of algorithms from different domains. Additionally, ClustPy includes methods that are often needed for research purposes, such as plots, clustering metrics or evaluation methods. Further, it integrates various frequently used datasets (e.g., from the UCI repository) through largely automated loading options.

The focus of the ClustPy package is not on efficiency (here we recommend e.g. pyclustering), but on the possibility to try out a wide range of modern scientific methods. In particular, this should also make lesser-known methods accessible in a simple and convenient way.

Since it largely follows the implementation conventions of sklearn clustering, it can be combined with many other packages (see below).

Installation

For Users

Stable Version

The current stable version can be installed by the following command:

pip install clustpy

Note that a gcc compiler is required for installation. Therefore, in case of an installation error, make sure that:

  • Windows: Microsoft C++ Build Tools is installed
  • Linux/Mac: Python dev is installed (e.g., by running apt-get install python-dev - the exact command may differ depending on the linux distribution)

The error messages may look like this:

  • 'error: command 'gcc' failed: No such file or directory'
  • 'Could not build wheels for clustpy, which is required to install pyproject.toml-based projects'
  • 'Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools'

Development Version

The current development version can be installed directly from git by executing:

sudo pip install git+https://github.com/collinleiber/ClustPy.git

Alternatively, clone the repository, go to the directory and execute:

sudo python setup.py install

If you have no sudo rights you can use:

python setup.py install --prefix ~/.local

For Developers

Clone the repository, go to the directory and do the following (NumPy must be installed beforehand).

Install package locally and compile C files:

python setup.py install --prefix ~/.local

Copy compiled C files to correct file location:

python setup.py build_ext --inplace

Remove clustpy via pip to avoid ambiguities during development, e.g., when changing files in the code:

pip uninstall clustpy

Components

Clustering Algorithms

  • Partition-based clustering
  • Density-based clustering
    • Multi Density DBSCAN [Paper]
  • Hierarchical clustering
  • Alternative clustering / Non-redundant clustering
  • Deep clustering

Other implementations

  • Metrics
    • Confusion Matrix
    • Fair Normalized Mutual Information (FNMI) [Paper]
    • Information-Theoretic External Cluster-Validity Measure (DOM) [Paper]
    • Pair Counting Scores (f1, rand, jaccard, recall, precision) [Paper]
    • Scores for multiple labelings (see alternative clustering algorithms)
      • Multiple Labelings Confusion Matrix
      • Multiple Labelings Pair Counting Scores [Paper]
    • Unsupervised Clustering Accuracy [Paper]
    • Variation of information [Paper]
  • Utils
    • Automatic evaluation methods
    • Hartigans Dip-test [Paper]
    • Various plots
  • Datasets
    • Synthetic dataset creators for subspace and alternative clustering
    • Real-world dataset loaders (e.g., Iris, Wine, Mice protein, Optdigits, MNIST, ...)
    • Dataset loaders for datasets with multiple labelings

Compatible packages

We stick as close as possible to the implementation details of sklean clustering. Therefore, our methods are compatible with many other packages. Examples are:

Coding Examples

1)

In this first example, the subspace algorithm SubKmeans is run on a synthetic subspace dataset. Afterwards, the clustering accuracy is calculated to evaluate the result.

from clustpy.partition import SubKmeans
from clustpy.data import create_subspace_data
from clustpy.metrics import unsupervised_clustering_accuracy as acc

data, labels = create_subspace_data(1000, n_clusters=4, subspace_features=[2,5])
sk = SubKmeans(4)
sk.fit(data)
acc_res = acc(labels, sk.labels_)
print("Clustering accuracy:", acc_res)

2)

The second example covers the topic of non-redundant/alternative clustering. Here, the NrKmeans algorithm is run on the Fruit dataset. Beware that NrKmeans as a non-redundant clustering algorithm returns multiple labelings. Therefore, we calculate the confusion matrix by comparing each combination of labels using the normalized mutual information (nmi). The confusion matrix will be printed and finally the best matching nmi will be stated for each set of labels.

from clustpy.alternative import NrKmeans
from clustpy.data import load_fruit
from clustpy.metrics import MultipleLabelingsConfusionMatrix
from sklearn.metrics import normalized_mutual_info_score as nmi
import numpy as np

data, labels = load_fruit()
nk = NrKmeans([3, 3])
nk.fit(data)
mlcm = MultipleLabelingsConfusionMatrix(labels, nk.labels_, nmi)
mlcm.rearrange()
print(mlcm.confusion_matrix)
print(np.max(mlcm.confusion_matrix, axis=1))

3)

One mentionable feature of the ClustPy package is the ability to run various modern deep clustering algorithms out of the box. For example, the following code runs the DEC algorithm on the Newsgroups dataset. To evaluate the result, we compute the adjusted RAND index (ari).

from clustpy.deep import DEC
from clustpy.data import load_newsgroups
from sklearn.metrics import adjusted_rand_score as ari

data, labels = load_newsgroups()
dec = DEC(20)
dec.fit(data)
my_ari = ari(labels, dec.labels_)
print(my_ari)

4)

In this more complex example, we use ClustPy's evaluation functions, which automatically run the specified algorithms multiple times on previously defined datasets. All results of the given metrics are stored in a Pandas dataframe.

from clustpy.utils import EvaluationDataset, EvaluationAlgorithm, EvaluationMetric, evaluate_multiple_datasets
from clustpy.partition import ProjectedDipMeans, SubKmeans
from sklearn.metrics import normalized_mutual_info_score as nmi, silhouette_score
from sklearn.cluster import KMeans, DBSCAN
from clustpy.data import load_breast_cancer, load_iris, load_wine
from clustpy.metrics import unsupervised_clustering_accuracy as acc
from sklearn.decomposition import PCA
import numpy as np

def reduce_dimensionality(X, dims):
    pca = PCA(dims)
    X_new = pca.fit_transform(X)
    return X_new

def znorm(X):
    return (X - np.mean(X)) / np.std(X)

def minmax(X):
    return (X - np.min(X)) / (np.max(X) - np.min(X))

datasets = [
    EvaluationDataset("Breast_pca_znorm", data=load_breast_cancer, preprocess_methods=[reduce_dimensionality, znorm],
                      preprocess_params=[{"dims": 0.9}, {}], ignore_algorithms=["pdipmeans"]),
    EvaluationDataset("Iris_pca", data=load_iris, preprocess_methods=reduce_dimensionality,
                      preprocess_params={"dims": 0.9}),
    EvaluationDataset("Wine", data=load_wine),
    EvaluationDataset("Wine_znorm", data=load_wine, preprocess_methods=znorm)]

algorithms = [
    EvaluationAlgorithm("SubKmeans", SubKmeans, {"n_clusters": None}),
    EvaluationAlgorithm("pdipmeans", ProjectedDipMeans, {}),  # Determines n_clusters automatically
    EvaluationAlgorithm("dbscan", DBSCAN, {"eps": 0.01, "min_samples": 5}, preprocess_methods=minmax,
                        deterministic=True),
    EvaluationAlgorithm("kmeans", KMeans, {"n_clusters": None}),
    EvaluationAlgorithm("kmeans_minmax", KMeans, {"n_clusters": None}, preprocess_methods=minmax)]

metrics = [EvaluationMetric("NMI", nmi), EvaluationMetric("ACC", acc),
           EvaluationMetric("Silhouette", silhouette_score, use_gt=False)]

df = evaluate_multiple_datasets(datasets, algorithms, metrics, n_repetitions=5,
                                aggregation_functions=[np.mean, np.std, np.max, np.min],
                                add_runtime=True, add_n_clusters=True, save_path=None,
                                save_intermediate_results=False)
print(df)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clustpy-0.0.2b0.tar.gz (2.3 MB view details)

Uploaded Source

File details

Details for the file clustpy-0.0.2b0.tar.gz.

File metadata

  • Download URL: clustpy-0.0.2b0.tar.gz
  • Upload date:
  • Size: 2.3 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.5

File hashes

Hashes for clustpy-0.0.2b0.tar.gz
Algorithm Hash digest
SHA256 e2c18257add04cc8241084a50f1a8cd22ae631677da86d6440e6a1273bb46e4e
MD5 81eff87bc2d70f209bebdf641b06414a
BLAKE2b-256 5b4e9da0002982586292ecb275d2a1c300adea162647f9313a7893f58eeddbf4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page