Skip to main content

Genie: Fast and Robust Hierarchical Clustering with Noise Points Detection

Project description

genieclust Package for R and Python

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection

genieclust for Python genieclust for R

Genie finds meaningful clusters quickly – even on large data sets.

A comprehensive tutorial, benchmarks, and a reference manual is available at https://genieclust.gagolewski.com/.

When using genieclust in research publications, please cite (Gagolewski, 2021) and (Gagolewski, Bartoszuk, Cena, 2016) as specified below. Thank you.

About

A faster and more powerful version of Genie – a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016), originally included in the R package genie.

The idea behind Genie is beautifully simple. First, make each individual point the only member of its own cluster. Then, keep merging pairs of the closest clusters, one after another. However, to prevent the formation of clusters of highly imbalanced sizes a point group of the smallest size will sometimes be matched with its nearest neighbour.

Genie's appealing simplicity goes hand in hand with its usability; it often outperforms other clustering approaches such as K-means, BIRCH, or average, Ward, and complete linkage on benchmark data. Of course, there is no, nor will there ever be, a single best universal clustering approach for every kind of problem, but Genie is definitely worth a try!

Thanks to its being based on minimal spanning trees of the pairwise distance graphs, Genie is also very fast – determining the whole cluster hierarchy for datasets of millions of points can be completed within minutes. Therefore, it is nicely suited for solving extreme clustering tasks (large datasets with any number of clusters to detect) for data (also sparse) that fit into memory. Thanks to the use of nmslib (if available), sparse or string inputs are also supported.

It also allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of HDBSCAN* (see Campello et al., 2013) that is able to detect a predefined number of clusters and hence it doesn't dependent on the DBSCAN's somewhat difficult-to-set eps parameter.

The package also features an implementation of economic inequality indices (the Gini, Bonferroni index), external cluster validity measures (e.g., the normalised clustering accuracy and partition similarity scores such as the adjusted Rand, Fowlkes-Mallows, adjusted mutual information, and the pair sets index), and internal cluster validity indices (e.g., the Calinski-Harabasz, Davies-Bouldin, Ball-Hall, Silhouette, and generalised Dunn indices).

Author and Contributors

Author and Maintainer: Marek Gagolewski

Contributors: Maciej Bartoszuk, Anna Cena (R packages genie and CVI), Peter M. Larsen (rectangular_lsap).

Examples, Tutorials, and Documentation

R's interface is compatible with stats::hclust(), but there is more.

X <- ... # some data
h <- gclust(X)
plot(h) # plot cluster dendrogram
cutree(h, k=2)
# or genie(X, k=2)

To learn more about R, check out Marek's open-access (free!) textbook Deep R Programming.

The Python language version of genieclust has a familiar scikit-learn-like look-and-feel:

import genieclust
X = ... # some data
g = genieclust.Genie(n_clusters=2)
labels = g.fit_predict(X)

Tutorials and the package documentation are available here.

To learn more about Python, check out Marek's recent open-access (free!) textbook Minimalist Data Wrangling in Python.

How to Install

Python Version

To install via pip (see PyPI):

pip3 install genieclust

The package requires Python 3.7+ together with cython as well as numpy, scipy, matplotlib, and scikit-learn. Optional dependencies: nmslib and mlpack.

R Version

To install the most recent release, call:

install.packages("genieclust")

See the package entry on CRAN.

Other

The core functionality is implemented in the form of a header-only C++ library. It can thus be easily adapted for use in other environments.

Any contributions are welcome (e.g., Julia, Matlab, ...).

License

Copyright (C) 2018–2024 Marek Gagolewski https://www.gagolewski.com/

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License Version 3, 19 November 2007, published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License Version 3 for more details. You should have received a copy of the License along with this program. If not, see (https://www.gnu.org/licenses/).


The file src/c_scipy_rectangular_lsap.h is adapted from the scipy project (https://scipy.org/scipylib), source: /scipy/optimize/rectangular_lsap/rectangular_lsap.cpp. Author: Peter M. Larsen. Distributed under the BSD-3-Clause license.

The implementation of internal cluster validity measures were adapted from our previous project (Gagolewski, Bartoszuk, Cena, 2021); see optim_cvi. Originally distributed under the GNU Affero General Public License Version 3.

References

Gagolewski M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15, 2021, 100722. DOI: 10.1016/j.softx.2021.100722. https://genieclust.gagolewski.com/.

Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8–23. DOI: 10.1016/j.ins.2016.05.003.

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 2021, 620–636. DOI: 10.1016/j.ins.2021.10.004.

Gagolewski M., Cena A., Bartoszuk M., Brzozowski L., Clustering with minimum spanning trees: How good can it be?, Journal of Classification, 2024, in press, DOI: 10.1007/s00357-024-09483-1.

Gagolewski M., Normalised clustering accuracy: An asymmetric external cluster validity measure, Journal of Classification, 2024, in press, DOI: 10.1007/s00357-024-09482-2.

Gagolewski M., A framework for benchmarking clustering algorithms, SoftwareX 20, 2022, 101270. DOI: 10.1016/j.softx.2022.101270. https://clustering-benchmarks.gagolewski.com/.

Campello R.J.G.B., Moulavi D., Sander J., Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160–172. DOI: 10.1007/978-3-642-37456-2_14.

Mueller A., Nowozin S., Lampert C.H., Information theoretic clustering using minimum spanning trees, DAGM-OAGM, 2012.

Rezaei M., Fränti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Engineering 28(8), 2016, 2173–2186 DOI: 10.1109/TKDE.2016.2551240.

See the package's homepage for more references.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genieclust-1.1.6.tar.gz (95.9 kB view details)

Uploaded Source

Built Distributions

genieclust-1.1.6-cp313-cp313-win_amd64.whl (696.9 kB view details)

Uploaded CPython 3.13 Windows x86-64

genieclust-1.1.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.13 manylinux: glibc 2.17+ x86-64

genieclust-1.1.6-cp313-cp313-macosx_11_0_arm64.whl (759.1 kB view details)

Uploaded CPython 3.13 macOS 11.0+ ARM64

genieclust-1.1.6-cp313-cp313-macosx_10_13_x86_64.whl (842.5 kB view details)

Uploaded CPython 3.13 macOS 10.13+ x86-64

genieclust-1.1.6-cp313-cp313-macosx_10_13_universal2.whl (1.5 MB view details)

Uploaded CPython 3.13 macOS 10.13+ universal2 (ARM64, x86-64)

genieclust-1.1.6-cp312-cp312-win_amd64.whl (699.7 kB view details)

Uploaded CPython 3.12 Windows x86-64

genieclust-1.1.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

genieclust-1.1.6-cp312-cp312-macosx_11_0_arm64.whl (770.0 kB view details)

Uploaded CPython 3.12 macOS 11.0+ ARM64

genieclust-1.1.6-cp312-cp312-macosx_10_9_x86_64.whl (861.2 kB view details)

Uploaded CPython 3.12 macOS 10.9+ x86-64

genieclust-1.1.6-cp312-cp312-macosx_10_9_universal2.whl (1.6 MB view details)

Uploaded CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

genieclust-1.1.6-cp311-cp311-win_amd64.whl (706.9 kB view details)

Uploaded CPython 3.11 Windows x86-64

genieclust-1.1.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.3 MB view details)

Uploaded CPython 3.11 manylinux: glibc 2.17+ x86-64

genieclust-1.1.6-cp311-cp311-macosx_11_0_arm64.whl (764.5 kB view details)

Uploaded CPython 3.11 macOS 11.0+ ARM64

genieclust-1.1.6-cp311-cp311-macosx_10_9_x86_64.whl (855.5 kB view details)

Uploaded CPython 3.11 macOS 10.9+ x86-64

genieclust-1.1.6-cp311-cp311-macosx_10_9_universal2.whl (1.6 MB view details)

Uploaded CPython 3.11 macOS 10.9+ universal2 (ARM64, x86-64)

genieclust-1.1.6-cp310-cp310-win_amd64.whl (704.6 kB view details)

Uploaded CPython 3.10 Windows x86-64

genieclust-1.1.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB view details)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

genieclust-1.1.6-cp310-cp310-macosx_11_0_arm64.whl (763.5 kB view details)

Uploaded CPython 3.10 macOS 11.0+ ARM64

genieclust-1.1.6-cp310-cp310-macosx_10_9_x86_64.whl (851.9 kB view details)

Uploaded CPython 3.10 macOS 10.9+ x86-64

genieclust-1.1.6-cp310-cp310-macosx_10_9_universal2.whl (1.5 MB view details)

Uploaded CPython 3.10 macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file genieclust-1.1.6.tar.gz.

File metadata

  • Download URL: genieclust-1.1.6.tar.gz
  • Upload date:
  • Size: 95.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.9

File hashes

Hashes for genieclust-1.1.6.tar.gz
Algorithm Hash digest
SHA256 4c159f507b84b6d6d171883223648d837c520a9bcce650944a6ee0cb320e2151
MD5 1d2041f145c036ab36f52d02f4426cfe
BLAKE2b-256 687cd465bab9f98b75c5c1f5e80165dd82847a504ced655d162b585df08a717b

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 a38d11fcc376ec37a0aff4cb6735104cf6cf85040fa3bd4228a80d3cf5a40139
MD5 1c9fdd8f401859b309d31a224277d490
BLAKE2b-256 82729c1b050a0e9cc3f3d45f782143cbb68902299f8b1008577395eb847c7878

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1897f534d0bc4b112482403de466be9bc698f259e4ace7689dadf0d14aa8e8a9
MD5 68e4345a21d924d5adfc4f28fafd6a65
BLAKE2b-256 a245eaaacaa4f4f2931a80d40e453df275d9af7c07616c5d753272d3055fb79e

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ac2cc9d910bac202688b4866ae9df937763e42115e0ec35e5083f8b0e01b3b78
MD5 e2a626065a473ba5f36a5d38d2bb27a2
BLAKE2b-256 09876198ad7b4d029943d67604c3d16e4d63d97e1f8ef5926f791af5018d2e30

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp313-cp313-macosx_10_13_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp313-cp313-macosx_10_13_x86_64.whl
Algorithm Hash digest
SHA256 87eb52f5171ee7e4f870498af0e35c52f2be87fb8c2122c507855a2015908b14
MD5 48a315ac77742534d520bc4fe8bbe502
BLAKE2b-256 a9699b5e88e94a8e7ddb60df2ca7cfeb8ea49031ed05ab3b4a92f43d759c4425

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp313-cp313-macosx_10_13_universal2.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp313-cp313-macosx_10_13_universal2.whl
Algorithm Hash digest
SHA256 97132921588e5999a875bf664ffb6134e801769d890d10896c2403c0be45d5bd
MD5 e4f802a083632037004b7ce24b7ef194
BLAKE2b-256 0bd0bfc6786818105f2d96c43d60fb2f2f9a9aeb5f7332ebbf0eb209239f89ff

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 be39c831a1d8a042cc7f4db8327123ff776a2220cdf881f4f7989e42531ce9e6
MD5 77b4a9c5d44adfad888d11477636250e
BLAKE2b-256 6689f8c6b9c02443689a55104b8782ea72cc1a4f28810889d542bdb843253515

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2c0eb2db78b43b3c0a8a05a64b3c1f53c802b377f5ed25fbe16cf231d695ae8b
MD5 3d082c34cf2a15d2fe0b8cf224a8fa78
BLAKE2b-256 def678ff355563aace95582a7171aed517cd060dee7ebdc964eaa9b5d2356ef6

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 856623d5759861c87f957253b3e268c25d9d6b6767d3d69dd73b1b44a113f0b0
MD5 6ac8e6cddfa8a5aa7abca585d26e2e3e
BLAKE2b-256 396bce7bbbd945fd69dc949e78b576be642425780fe70d3c0d956afebec0e7ad

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp312-cp312-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp312-cp312-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 07046579ea6ea3ddfb9fe65c2925c7e7aeb75f6a8c0b24ee91d2b8f2b30e6cea
MD5 85364176e58dba0be3a4a375c4080899
BLAKE2b-256 98914f8eb0fed3a1bdbe2aeaf12fb0ca1ff864300f9e9ff861811d877fbbec3c

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp312-cp312-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp312-cp312-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 ad2494ab7d65006f34fda051b60b8874273a8715f10384d0c34fe2b4284347c2
MD5 70904aefc97a3dc8b2696ee3df219d30
BLAKE2b-256 d84fb579c391880a6b4e9c92e0ddb1581d4c01158196ecbe80a9656214d025cf

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 195bef59fd3923e34a4470383188214735f700f8253d3c27d70d1d821e3d733d
MD5 438fd33c22c4ee82944996cf6e69b67d
BLAKE2b-256 65c41efc179901565077aabd666eff9caddf1df3ef1f3bda2b92545c7bbb3858

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1b554c365f4f79fa69d7205f42e301f67ccb2b42308b7ca067212fb7e0734547
MD5 f026c5418a1414685f811bc07303a245
BLAKE2b-256 2a09d1fd7b02cfabe76262d0f88d74fa71dc93e857525f8249539ec5ab174292

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a7481f23f4b5e3833161ba3fcada5a1ef71818121caa0b1e5c17db2bd9acf78f
MD5 146f20db6c9e6002ec4e5f5fb15433f6
BLAKE2b-256 f3d1fdd4f0edc5f1ee4f53cab8766bfcdf05c8f70e5b2d71bc678b279e5a1af6

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 8e88ef591f5e273067a5a5588b119d20ad2966038d4bb2e2e8c2ea1e369391f8
MD5 8ab9b001be120da970404eabeca54c7c
BLAKE2b-256 b1f36dda917fead79dab7a96a60bd0dbec529fd858fd42f596e388e8f7be8e01

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp311-cp311-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp311-cp311-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 8202bd5f2af47a4c3b34074b73a0ded10c8858cf9c2d544df3ca479585f66a54
MD5 d566f883fd866f50af0db423d378eddb
BLAKE2b-256 03362df850649e27061a55ae492cbeb2c91bd90d6785376989276a7e6cd7ad7c

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp310-cp310-win_amd64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 515c85593b2aab68039e1d81ec7318c064ef7b68fb4b48285e12654079765dbc
MD5 e9d0d51dda9bf25bd18789d76024eb61
BLAKE2b-256 b877bc84a42494ffb640597a9b426a34366ffc55fbf7fd6082d2bc598205123b

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 0090a966d1a1185a0896cb9bdaa84d8dfdeea9143691b19901e0db0010bd288d
MD5 93c94cfdacba52840ac3fc046f200fd2
BLAKE2b-256 b32faddc5b1573ea51efe64abc85dc74089633219cf7885a5f4dfe68c21219e1

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a454f692166672a39fc2255de801f7b5c5a1899d93e1e80c34dda5e172c21a14
MD5 954302c4cf45814ae629757c5f702500
BLAKE2b-256 ec81bc0cf971160189a3ce617cefcb0ad13f92df847eace965235e0910101814

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 31768b5d6654ef6ade2c564605272c0418f8814e16881885cac8dc2d2899f70b
MD5 bb7ddaecd6e1d3c35afd65bb49a30df3
BLAKE2b-256 5e008424d9071482500c5a11213a750f1ef2451c7ebdeae7139214f4db5a698e

See more details on using hashes here.

File details

Details for the file genieclust-1.1.6-cp310-cp310-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for genieclust-1.1.6-cp310-cp310-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 0c5b2888e7b335aa28af0f5ac1eb988fe578f4f96b46275ca387a74a1adeb3e6
MD5 d77d3fdb6a31a48e9d6e963d1348afe1
BLAKE2b-256 baa92005be4240d5d454b1a2a490ec5684077805b934a9dcc15d60773e34d706

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page