Skip to main content

Genie: Fast and Robust Hierarchical Clustering with Noise Points Detection

Project description

genieclust Package for R and Python

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection

genieclust for Python genieclust for R codecov

Genie outputs meaningful clusters and is fast even on large data sets.

A comprehensive tutorial, benchmarks, and a reference manual is available at https://genieclust.gagolewski.com/.

When using genieclust in research publications, please cite (Gagolewski, 2021) and (Gagolewski, Bartoszuk, Cena, 2016) as specified below. Thank you.

About

A faster and more powerful version of Genie – a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016), originally included in the R package genie.

The idea behind Genie is beautifully simple. First, make each individual point the only member of its own cluster. Then, keep merging pairs of the closest clusters, one after another. However, to prevent the formation of clusters of highly imbalanced sizes a point group of the smallest size will sometimes be matched with its nearest neighbour.

Genie's appealing simplicity goes hand in hand with its usability; it often outperforms other clustering approaches such as K-means, BIRCH, or average, Ward, and complete linkage on benchmark data.

Genie is also very fast – determining the whole cluster hierarchy for datasets of millions of points can be completed within minutes. Therefore, it is nicely suited for solving of extreme clustering tasks (large datasets with any number of clusters to detect) for data (also sparse) that fit into memory. Thanks to the use of nmslib, sparse or string inputs are also supported.

It also allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of HDBSCAN* (see Campello et al., 2015) that is able to detect a predefined number of clusters and hence it doesn't dependent on the DBSCAN's somewhat difficult-to-set eps parameter.

Author and Contributors

Author and Maintainer: Marek Gagolewski

Contributors: Maciej Bartoszuk, Anna Cena (R packages genie /genieclust's predecessor/ and CVI /some internal cluster validity measures/), Peter M. Larsen (an implementation of the shortest augmenting path algorithm for the rectangular assignment problem which we use for computing the normalised accuracy and pair sets index).

Python and R Package Features

The implemented algorithms include:

  • Genie++ – a reimplementation of the original Genie algorithm (Gagolewski et al., 2016); much faster than the original one; supports approximate disconnected MSTs;

  • Genie+HDBSCAN* – a robustified (Geniefied) retake on the HDBSCAN* (Campello et al., 2015) method that detects noise points in data and outputs clusters of predefined sizes;

  • (Python only, experimental preview) Genie+Ic (GIc) – Cena's (2018) algorithm to minimise the information theoretic criterion discussed by Mueller et al. (2012).

See classes genieclust.Genie and genieclust.GIc (in Python) or functions gclust() and genieclust() (in R).

Other features:

  • inequality measures: the normalised Gini, Bonferroni, and De Vergottini indices;

  • external cluster validity measures: adjusted asymmetric accuracy and partition similarity scores such as normalised accuracy, pair sets index (PSI), adjusted&unadjusted Rand, adjusted&unadjusted Fowlkes-Mallows (FM), adjusted&normalised&unadjusted mutual information (MI) indices;

  • internal cluster validity measures: the Caliński-Harabasz, Silhouette, Ball-Hall, Davies-Bouldin, generalised Dunn indices, etc.;

  • (Python only) Union-find (disjoint sets) data structures (with extensions);

  • (Python only) Some R-like plotting functions.

Examples, Tutorials, and Documentation

R's interface is compatible with stats::hclust(), but there is more.

X <- ... # some data
h <- gclust(X)
plot(h) # plot cluster dendrogram
cutree(h, k=2)
# or genie(X, k=2)

The Python language version of genieclust has a familiar scikit-learn-like look-and-feel:

import genieclust
X = ... # some data
g = genieclust.Genie(n_clusters=2)
labels = g.fit_predict(X)

The tutorials and the package documentation is available here.

To learn more about Python, check out Marek's recent open-access (free!) textbook Minimalist Data Wrangling in Python.

How to Install

Python Version

To install via pip (see PyPI):

pip3 install genieclust

The package requires Python 3.7+ together with cython as well as numpy, scipy, matplotlib, nmslib, and scikit-learn. Optional dependency: mlpack.

R Version

To install the most recent release, call:

install.packages("genieclust")

See the package entry on CRAN.

Other

The core functionality is implemented in the form of a header-only C++ library. It can thus be easily adapted for use in other environments.

Any contributions are welcome (e.g., Julia, Matlab, ...).

License

Copyright (C) 2018–2022 Marek Gagolewski https://www.gagolewski.com

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License Version 3, 19 November 2007, published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License Version 3 for more details. You should have received a copy of the License along with this program. If not, see (https://www.gnu.org/licenses/).


The file src/c_scipy_rectangular_lsap.h is adapted from the scipy project (https://scipy.org/scipylib/), source: /scipy/optimize/rectangular_lsap/rectangular_lsap.cpp. Author: Peter M. Larsen. Distributed under the BSD-3-Clause license.

The implementation of internal cluster validity measures were adapted from our previous project (Gagolewski, Bartoszuk, Cena, 2021); see optim_cvi. Originally distributed under the GNU Affero General Public License Version 3.

References

Gagolewski M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15, 2021, 100722. DOI: 10.1016/j.softx.2021.100722.

Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8–23. DOI: 10.1016/j.ins.2016.05.003.

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 2021, 620–636. DOI: 10.1016/j.ins.2021.10.004.

Gagolewski M., Adjusted asymmetric accuracy: A well-behaving external cluster validity measure, 2022, submitted for publication.

Gagolewski M., A Framework for Benchmarking Clustering Algorithms. 2022, https://clustering-benchmarks.gagolewski.com.

Cena A., Adaptive hierarchical clustering algorithms based on data aggregation methods, PhD Thesis, Systems Research Institute, Polish Academy of Sciences, 2018.

Campello R., Moulavi D., Zimek A., Sander J., Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data 10(1), 2015, 5:1–5:51. DOI: 10.1145/2733381.

Mueller A., Nowozin S., Lampert C.H., Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012.

See the package's homepage for more references.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genieclust-1.1.0.tar.gz (90.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

genieclust-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

genieclust-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl (829.4 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

genieclust-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

genieclust-1.1.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (4.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ i686

genieclust-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl (824.6 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

genieclust-1.1.0-cp38-cp38-win_amd64.whl (628.8 kB view details)

Uploaded CPython 3.8Windows x86-64

genieclust-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

genieclust-1.1.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (4.9 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ i686

genieclust-1.1.0-cp38-cp38-macosx_10_9_x86_64.whl (804.5 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

genieclust-1.1.0-cp37-cp37m-win_amd64.whl (615.4 kB view details)

Uploaded CPython 3.7mWindows x86-64

genieclust-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

genieclust-1.1.0-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (4.5 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ i686

genieclust-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl (799.7 kB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

File details

Details for the file genieclust-1.1.0.tar.gz.

File metadata

  • Download URL: genieclust-1.1.0.tar.gz
  • Upload date:
  • Size: 90.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.0.tar.gz
Algorithm Hash digest
SHA256 beb9f3d46e0a0a92401031b9604a59c81521c746499442eaa6b48897e565cd3f
MD5 c9e279d7191eef8da31421bb51605707
BLAKE2b-256 ea8745277a205bdd6606527eb3cb7ae016c5b8d56b4d390d41c0d1fd0f62ade2

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 57d664a4422a6388b943c8ab6234cf418de6b7e64d5164dbd9142ec09e371933
MD5 d1714bf56269c8ee85b23531e6cc22a5
BLAKE2b-256 02605cffd0e580d850b38e68827e3df9d0235ff0a5f51c87a172ddaaf488ec41

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 530fa0b8f303826543bf64716cc833f935d5aa1a5e529f5248b61e150700272d
MD5 ae522d2e55857807176f960252fe8c4a
BLAKE2b-256 2ff36cdc76563389d19d398c55d9fbd823cf3a4ba3579abc3f1157af71c90293

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee35e16b293e04fe5e8856a77f683721f66aeeff47f443200c89809946407b15
MD5 9e0253e27c98dd1b3cd327408ee627de
BLAKE2b-256 d3caf82771469634ae75cc8ab8282189a0314bc71bf97fa46acfe5585da86199

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 7e65af01cef5b6d7fe196372fdd4859834a5760633792045f2cd6d9f9228de1e
MD5 044b7b7c8c895999c8f520a607f929c4
BLAKE2b-256 29ea24870378dd3624721f41f1c61337234cf079e1d0c879589231383c664999

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 dfef51e254ff5e06ceebc82d2c5699f7d07b26751869556ff288449c81b6a25f
MD5 92da2bd42e276d287c55d27d6a4869c3
BLAKE2b-256 1270fcd194a2598e127d34f51b67bc4ba9a79828a79d8ee6a9f7c372edef9427

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: genieclust-1.1.0-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 628.8 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.0-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 99bf70679cd412a87fc216564fb38e8e86a0beb83c2c39ca7571d7b4a144418c
MD5 93c7fc5a98a47cea7d88464f5689f27d
BLAKE2b-256 c7cdee7c2e3d8d20c397e3a9aea64ef441f4d01d3a0996b4066e47f980ac508e

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2546ae3c3a18346f83c1e6b31e1683ac8922c33bb7e2b2172395f6889f3c56d4
MD5 a025b03da7b1ab68e3c70e6402386523
BLAKE2b-256 0478976764178e2ca9e0baa292bf606061711081364b1a08ec87ca58fcdd875c

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 b10d5ab3e6a2198ecaf53413f2440c977cfbbe74e2897521ed945899e98a90cf
MD5 89987d114d8495833fdbcb094b3ad1cf
BLAKE2b-256 bac07545e2a1a7296deaa6f81f592de3f17da999a8b4e8cce12d424a26340819

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4683e5440d39e509a7f1b29a0ddf7e992f175828cd8634b64ff7625bff4996da
MD5 f4dbd35412b9d11a04e52ddf8364b095
BLAKE2b-256 2050f9653e3bc4c474cba2c42beb116a6cf8505c73ae18f19d7f25cb3fd53a09

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: genieclust-1.1.0-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 615.4 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.0-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 fc089b7e58312f8f322f5e8d8a259ec8812d669dd8a9148fa217924e4fb5ee66
MD5 a36b64d684f257439551fe8dddfdfed0
BLAKE2b-256 a92fadc868368a3281d3523b2cda516362c72ddea315b9bf0481d0e765c00b2d

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ff0d44374c4cef6db24e631ccec159cb526f07e27b10aa0e3aec86b98afe94fa
MD5 ccf0dca6fe5c3b7b8977708537bfbceb
BLAKE2b-256 1faf9ee7a2acefa2e86a93ed8798a020ac894a0dc9fa3d10678989f4f537f7a5

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 12d43302b93bdbaed43b7a685f24e22dcbf19aab7a47a9eabe9cfb9cc6819367
MD5 05ebc8cdd1331e6804dc03b78c486a99
BLAKE2b-256 9760950ce9295e02dca5c07eeb801919b4355e15131ac2023c9c8bbe40f17acd

See more details on using hashes here.

File details

Details for the file genieclust-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d2c655ea3c5f6124bb6375e794cd9e4a88a8e48fc1112d9a2b2d1ce16d765dcc
MD5 904c5747c623765dd94159b8db98a009
BLAKE2b-256 e3ed923c98a8fa230f9858abaa8b3d6e7be9a40764a1283132c82036835996da

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page