Skip to main content

Genie: Fast and Robust Hierarchical Clustering with Noise Points Detection

Project description

genieclust Package for R and Python

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection

genieclust for Python genieclust for R codecov

Genie finds meaningful clusters quickly – even on large data sets.

A comprehensive tutorial, benchmarks, and a reference manual is available at https://genieclust.gagolewski.com/.

When using genieclust in research publications, please cite (Gagolewski, 2021) and (Gagolewski, Bartoszuk, Cena, 2016) as specified below. Thank you.

About

A faster and more powerful version of Genie – a robust and outlier resistant clustering algorithm (see Gagolewski, Bartoszuk, Cena, 2016), originally included in the R package genie.

The idea behind Genie is beautifully simple. First, make each individual point the only member of its own cluster. Then, keep merging pairs of the closest clusters, one after another. However, to prevent the formation of clusters of highly imbalanced sizes a point group of the smallest size will sometimes be matched with its nearest neighbour.

Genie's appealing simplicity goes hand in hand with its usability; it often outperforms other clustering approaches such as K-means, BIRCH, or average, Ward, and complete linkage on benchmark data. Of course, there is no, nor will there ever be, a single best universal clustering approach for every kind of problem, but Genie is definitely worth a try!

Thanks to its being based on minimal spanning trees of the pairwise distance graphs, Genie is also very fast – determining the whole cluster hierarchy for datasets of millions of points can be completed within minutes. Therefore, it is nicely suited for solving of extreme clustering tasks (large datasets with any number of clusters to detect) for data (also sparse) that fit into memory. Thanks to the use of nmslib (if available), sparse or string inputs are also supported.

It also allows clustering with respect to mutual reachability distances so that it can act as a noise point detector or a robustified version of HDBSCAN* (see Campello et al., 2013) that is able to detect a predefined number of clusters and hence it doesn't dependent on the DBSCAN's somewhat difficult-to-set eps parameter.

Author and Contributors

Author and Maintainer: Marek Gagolewski

Contributors: Maciej Bartoszuk, Anna Cena (R packages genie and CVI), Peter M. Larsen (rectangular_lsap).

Examples, Tutorials, and Documentation

R's interface is compatible with stats::hclust(), but there is more.

X <- ... # some data
h <- gclust(X)
plot(h) # plot cluster dendrogram
cutree(h, k=2)
# or genie(X, k=2)

The Python language version of genieclust has a familiar scikit-learn-like look-and-feel:

import genieclust
X = ... # some data
g = genieclust.Genie(n_clusters=2)
labels = g.fit_predict(X)

Tutorials and the package documentation are available here.

To learn more about Python, check out Marek's recent open-access (free!) textbook Minimalist Data Wrangling in Python.

How to Install

Python Version

To install via pip (see PyPI):

pip3 install genieclust

The package requires Python 3.7+ together with cython as well as numpy, scipy, matplotlib, and scikit-learn. Optional dependencies: nmslib and mlpack.

R Version

To install the most recent release, call:

install.packages("genieclust")

See the package entry on CRAN.

Other

The core functionality is implemented in the form of a header-only C++ library. It can thus be easily adapted for use in other environments.

Any contributions are welcome (e.g., Julia, Matlab, ...).

License

Copyright (C) 2018–2022 Marek Gagolewski https://www.gagolewski.com

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License Version 3, 19 November 2007, published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License Version 3 for more details. You should have received a copy of the License along with this program. If not, see (https://www.gnu.org/licenses/).


The file src/c_scipy_rectangular_lsap.h is adapted from the scipy project (https://scipy.org/scipylib/), source: /scipy/optimize/rectangular_lsap/rectangular_lsap.cpp. Author: Peter M. Larsen. Distributed under the BSD-3-Clause license.

The implementation of internal cluster validity measures were adapted from our previous project (Gagolewski, Bartoszuk, Cena, 2021); see optim_cvi. Originally distributed under the GNU Affero General Public License Version 3.

References

Gagolewski M., genieclust: Fast and robust hierarchical clustering, SoftwareX 15, 2021, 100722. DOI: 10.1016/j.softx.2021.100722.

Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm, Information Sciences 363, 2016, 8–23. DOI: 10.1016/j.ins.2016.05.003.

Gagolewski M., Bartoszuk M., Cena A., Are cluster validity measures (in)valid?, Information Sciences 581, 2021, 620–636. DOI: 10.1016/j.ins.2021.10.004.

Gagolewski M., Adjusted asymmetric accuracy: A well-behaving external cluster validity measure, under review (preprint), DOI: 10.48550/arXiv.2209.02935.

Gagolewski M., A Framework for Benchmarking Clustering Algorithms, 2022, https://clustering-benchmarks.gagolewski.com.

Campello R.J.G.B., Moulavi D., Sander J., Density-based clustering based on hierarchical density estimates, Lecture Notes in Computer Science 7819, 2013, 160–172. DOI: 10.1007/978-3-642-37456-2_14.

Mueller A., Nowozin S., Lampert C.H., Information Theoretic Clustering using Minimum Spanning Trees, DAGM-OAGM, 2012.

Rezaei M., Fränti P., Set matching measures for external cluster validity, IEEE Transactions on Knowledge and Data Engineering 28(8), 2016, 2173–2186 DOI: 10.1109/TKDE.2016.2551240.

See the package's homepage for more references.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

genieclust-1.1.1.tar.gz (89.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

genieclust-1.1.1-cp310-cp310-win_amd64.whl (624.3 kB view details)

Uploaded CPython 3.10Windows x86-64

genieclust-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

genieclust-1.1.1-cp310-cp310-macosx_10_9_x86_64.whl (830.8 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

genieclust-1.1.1-cp39-cp39-win_amd64.whl (628.6 kB view details)

Uploaded CPython 3.9Windows x86-64

genieclust-1.1.1-cp39-cp39-win32.whl (530.9 kB view details)

Uploaded CPython 3.9Windows x86

genieclust-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

genieclust-1.1.1-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl (4.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ i686

genieclust-1.1.1-cp39-cp39-macosx_10_9_x86_64.whl (826.0 kB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

genieclust-1.1.1-cp38-cp38-win_amd64.whl (628.4 kB view details)

Uploaded CPython 3.8Windows x86-64

genieclust-1.1.1-cp38-cp38-win32.whl (530.7 kB view details)

Uploaded CPython 3.8Windows x86

genieclust-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

genieclust-1.1.1-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl (4.9 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ i686

genieclust-1.1.1-cp38-cp38-macosx_10_9_x86_64.whl (806.7 kB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

genieclust-1.1.1-cp37-cp37m-win_amd64.whl (615.0 kB view details)

Uploaded CPython 3.7mWindows x86-64

genieclust-1.1.1-cp37-cp37m-win32.whl (523.1 kB view details)

Uploaded CPython 3.7mWindows x86

genieclust-1.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

genieclust-1.1.1-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (4.6 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ i686

genieclust-1.1.1-cp37-cp37m-macosx_10_9_x86_64.whl (801.7 kB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

File details

Details for the file genieclust-1.1.1.tar.gz.

File metadata

  • Download URL: genieclust-1.1.1.tar.gz
  • Upload date:
  • Size: 89.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.1.tar.gz
Algorithm Hash digest
SHA256 db1c783580c3262e96fdc321fb0d97100ce8118c07e458086384e8fd1ee0a10f
MD5 8d916d848c265f5b2a004303cddeeb2b
BLAKE2b-256 95182f3d6b52611bb8848cce1311a6ced5d7b11fbe90b1f7b5656167c9551705

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: genieclust-1.1.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 624.3 kB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 7e27f2538b60e28e7d95ea04b8103677c9ac8bb9ae01f8f7fee443f9b6dc80a5
MD5 c50ebcc45322d13a41fbd5db6cf122b6
BLAKE2b-256 f8b8d9fe6e4ec60abc9a17fe22c3eec90b069f722e7d6917fc6ff29e2d503421

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7c23dee099112065b6f599eb1d533421647b8758e48db1686c666b6176d917e2
MD5 ec16d2e3a8fe5114abdd8c6ecc07a749
BLAKE2b-256 ca10c9645418a7908a04a8c8dfc4c36aefcde4d5744c1432372bbefcc8f9eb5e

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2b74cb1d2a5485dc221d34b5f77dde4b6581d8d91c43d9e2237af0e8d3b61ab3
MD5 72e7b8aaf5c5f572fa0e3e71db3d6733
BLAKE2b-256 6e98c63f34c60c2dece276f1618da298a2ac1e4b1b1eec50d4508a50f14e6de9

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp39-cp39-win_amd64.whl.

File metadata

  • Download URL: genieclust-1.1.1-cp39-cp39-win_amd64.whl
  • Upload date:
  • Size: 628.6 kB
  • Tags: CPython 3.9, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.1-cp39-cp39-win_amd64.whl
Algorithm Hash digest
SHA256 e3987480a7340711bc9a7488cbe54e8dae7dbecadf05f69714293519687f180b
MD5 0e74838b196c1a08bdd74c9205ace2e2
BLAKE2b-256 bbfc1f99998e5d605d20648af0302021c5c6184999925ac6eb660fa8193555aa

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp39-cp39-win32.whl.

File metadata

  • Download URL: genieclust-1.1.1-cp39-cp39-win32.whl
  • Upload date:
  • Size: 530.9 kB
  • Tags: CPython 3.9, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.1-cp39-cp39-win32.whl
Algorithm Hash digest
SHA256 ee898a8a84dd8dd3ec01fa9f7058fdc9f0d8d8caeccd68d4200da1b5e758481d
MD5 ca677dea02094068f2905e3c9dfb100d
BLAKE2b-256 bdbe8a3ad74254386ed306c187906c8e8c9228abb8bf45e8ae1a8c1ba9763b86

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ee3b3e4e0e865e03f45ca503d6bf3b37e3a7a634bec5452d45439b8fca0a9d49
MD5 afd6f1bfcec91fbda3f7521433b10bdd
BLAKE2b-256 deb09ef2c8084bbbd9d5ad88e4984185287495175362f3f37ce56423bc64a42e

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp39-cp39-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 4c99171d3532b165bc65bce3d34108bdf691acd79fe14eeef61f1a8f9fee236f
MD5 d2eb2187088dac358feed03ff7d913f4
BLAKE2b-256 453c0a22fe3e0c8f82b88756ad21cf4e595af2b771cd38d8bd397285ad375af7

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5753564d5a9062420a6b42289ee7218bb2303e1a6dd8d916f6f5860bdbe2ff66
MD5 4275ec3ebfab2079d747dab4f5c4111e
BLAKE2b-256 12d13652fc1bf6d1cce40417404d553c31d17fe164e153c5ed917ec0d923a153

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: genieclust-1.1.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 628.4 kB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 0375398593d179b7cf3d8d96cd79f4a0e813d82637063f72ce628f85a49aa2bb
MD5 4981baae454c8cbb91dcb72184e21f1e
BLAKE2b-256 665674187f027984baae70a05f0c7ffe3e9d10f61a80e977b167cf9800bf59b9

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp38-cp38-win32.whl.

File metadata

  • Download URL: genieclust-1.1.1-cp38-cp38-win32.whl
  • Upload date:
  • Size: 530.7 kB
  • Tags: CPython 3.8, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.1-cp38-cp38-win32.whl
Algorithm Hash digest
SHA256 56926862cddce123043db312744c4514e7c323ac88fd48e41ca0fa5382ba9cfb
MD5 a4851b86f3bc03c3a6af9660fd79f2d1
BLAKE2b-256 b495ebe985d59e3626903348ff097dab3abf60dac3d3457fea4925daa06302f9

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b70b3fc53a7d04c69dae810438e00551c6cb8986dedd97933bb80d69342409ea
MD5 c637572990ffc5d22beb71c4f47a7162
BLAKE2b-256 0f7ae69172126af8f9a827c74fc7948b51adfae37c01708ea6dcc86df23a9551

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp38-cp38-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 ed36d8433a72b57bd6551f503410741ee81215038c4b4a2d64bde0ee9df2e183
MD5 0a7a1d578f7f2c512980ab4ae286edc5
BLAKE2b-256 a1042788effbcbdfbafba236d764ac7b46a67bddcef0447c492bbd0b96fe24af

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c7c1ee50ba40125229c1ddf4163c6ec56c2a2706800af83939f9d6480eb38aba
MD5 727ca72d4f833df9f6fcbaaf0b7017b4
BLAKE2b-256 76461f9fd6044ca10e9f5237e0574c1088e6b90e8bbc8192247824dd33001c4e

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: genieclust-1.1.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 615.0 kB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 e2dacf14ebe3b34c5fc8acbf2c6bba49aa1c50bda0f9f1f2da229360cac7fcc0
MD5 94e83dfddfc5331f98dc704249b496a6
BLAKE2b-256 fa01eb2a37faf1cc18728e68c10ef2d0a539ac6104c059d7bf355369a037bda9

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp37-cp37m-win32.whl.

File metadata

  • Download URL: genieclust-1.1.1-cp37-cp37m-win32.whl
  • Upload date:
  • Size: 523.1 kB
  • Tags: CPython 3.7m, Windows x86
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for genieclust-1.1.1-cp37-cp37m-win32.whl
Algorithm Hash digest
SHA256 f47e12ba8b59ac3ce36182d68e35e42b7fc7a1af6237ab20f93b4d121d8943c6
MD5 640d4b672ee4571fb80d18b192190d99
BLAKE2b-256 95ca93ac2d0bc30800e22d9bcaee431ff7f2ed12d215cd1529acb182cc5c06e2

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2cd8276b68b43ec053a9ba9d7251dc0ac2a46663657801e77830e67bae29d790
MD5 5e809c06216c8395db99ad85d21ee245
BLAKE2b-256 beddb36e9a9e6b8ea322a3bb3da1508f51ff2fdcae5a1d03551f000adaf7e8b5

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl
Algorithm Hash digest
SHA256 ad75db36ba7bb230cc449e21221138122b186e8cd9a888e6ce60b136e779e3b0
MD5 6d2a2a6a858f44eafb675cf034a512a1
BLAKE2b-256 4c7b8198c94653a65cb675ab3da8e46da599114f42bdad90b6b28e7bc1c0230c

See more details on using hashes here.

File details

Details for the file genieclust-1.1.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for genieclust-1.1.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bebb86f00f0afdaaaa375434c1428e7eb1fc28f767b25dd6ef830b26f29d5daa
MD5 2cb95950d22cf1ae0f24e44054044a3d
BLAKE2b-256 fee09b1ae66b8337aa41b4371176a7b27b62a9fb6f31b8163334f599a6cd914c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page