Skip to main content

Density-based clustering for exploratory data analysis based on multi-parameter persistence

Project description

PyPI Downloads tests coverage docs status license

Persistent and stable clustering (Persistable) is a density-based clustering algorithm intended for exploratory data analysis. What distinguishes Persistable from other clustering algorithms is its visualization capabilities. Persistable's interactive mode lets you visualize multi-scale and multi-density cluster structure present in the data. This is used to guide the choice of parameters that lead to the final clustering.

Usage

Here is a brief outline of the main functionality; see the documentation for details, including the API reference.

In order to run Persistable's interactive mode from a Jupyter notebook, run the following in a Jupyter cell:

import persistable
from sklearn.datasets import make_blobs

X = make_blobs(2000, centers=4, random_state=1)[0]

p = persistable.Persistable(X)
pi = persistable.PersistableInteractive(p)
pi.start_ui()

The last command returns the port in localhost serving the UI, which is 8050 by default. Now go to localhost:8050 in your web browser to access the graphical user interface:

Alt text

After choosing your parameters using the user interface, you can get your clustering in another Jupyter cell by running:

clustering_labels = pi.cluster()

Note: You may use pi.start_ui(jupyter_mode="inline") to have the graphical user interface display directly in the Jupyter notebook!

Installing

Make sure you are using Python 3. Persistable depends on the following python packages, which will be installed automatically when you install with pip: numpy, scipy, scikit-learn, cython, plotly, dash, diskcache, multiprocess, psutil. To install from pypi, simply run the following:

pip install persistable-clustering

Documentation and support

You can find the documentation at persistable.readthedocs.io. If you have further questions, please open an issue and we will do our best to help you. Please include as much information as possible, including your system's information, warnings, logs, screenshots, and anything else you think may be of use. If you do not wish to open an issue, you are also welcome to contact Luis Scoccola directly. Please be patient if it takes us a bit to get back to you.

Running the tests

You can run the tests by running the following commands from the root directory of a clone of this repository. If a test fails, please report a bug, trying to include as much information as possible, including your system's information, warnings, logs, screenshots, and anything else you think may be of use.

pip install pytest playwright pytest-playwright
python -m playwright install --with-deps
pip install -r requirements.txt
python -m setup build_ext --inplace
pytest .

Details about theory and implementation

Persistable is based on multi-parameter persistence [4], a method from topological data analysis. The theory behind Persistable is developed in [1], while this implementation uses the high performance algorithms for density-based clustering developed in [2] and implemented in [3]. Persistable's interactive mode is inspired by RIVET [5] and is implemented in Dash.

Contributing

To contribute, you can fork the project, make your changes, and submit a pull request. You may want to contact Luis Scoccola first, to make sure your work does not overlap with ongoing work.

Authors

Luis Scoccola and Alexander Rolle.

Citing

If you use this package in your work, you may cite the corresponding paper using the following bibtex entry:

@article{Scoccola2023,
    doi = {10.21105/joss.05022},
    url = {https://doi.org/10.21105/joss.05022},
    year = {2023},
    publisher = {The Open Journal},
    volume = {8},
    number = {83},
    pages = {5022},
    author = {Luis Scoccola and Alexander Rolle},
    title = {Persistable: persistent and stable clustering},
    journal = {Journal of Open Source Software}
}

References

[1] Stable and consistent density-based clustering. A. Rolle and L. Scoccola. arXiv:2005.09048

[2] Accelerated Hierarchical Density Based Clustering. L. McInnes, J. Healy. 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

[3] hdbscan: Hierarchical density based clustering. L. McInnes, J. Healy, S. Astels. Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

[4] An Introduction to Multiparameter Persistence. M. B. Botnan, M. Lesnick. Proceedings of the 2020 International Conference on Representations of Algebras. 2022

[5] RIVET. The RIVET Developers. [Git] [docs]

License

This software is published under the 3-clause BSD license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

persistable_clustering-0.5.3.tar.gz (69.1 kB view details)

Uploaded Source

Built Distributions

persistable_clustering-0.5.3-cp312-cp312-win_amd64.whl (582.1 kB view details)

Uploaded CPython 3.12 Windows x86-64

persistable_clustering-0.5.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.5 MB view details)

Uploaded CPython 3.12 manylinux: glibc 2.17+ x86-64

persistable_clustering-0.5.3-cp312-cp312-macosx_10_9_universal2.whl (1.2 MB view details)

Uploaded CPython 3.12 macOS 10.9+ universal2 (ARM64, x86-64)

File details

Details for the file persistable_clustering-0.5.3.tar.gz.

File metadata

File hashes

Hashes for persistable_clustering-0.5.3.tar.gz
Algorithm Hash digest
SHA256 13d923c9e9ec4a28d8fc1826f11637ef4b14ede3c62f5e2c9ca6fc8962e1a728
MD5 0fe3e278eea7d0d4692c56ff04994f1d
BLAKE2b-256 a6c5b35b1d994170b0b78932c0efd3addfac4fc7debc1c1261384deb755a6d4e

See more details on using hashes here.

File details

Details for the file persistable_clustering-0.5.3-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for persistable_clustering-0.5.3-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ebd082a74863d0debbb4ca1874de0f42cd397061539e011dff4dd2ca90b38f24
MD5 cf685fc7f1fc62fde19b0ff0774f4925
BLAKE2b-256 7e84693d66b1d64b78d53c7608b1c3ff15af495ec66d42fed3ccf7d60c2101ba

See more details on using hashes here.

File details

Details for the file persistable_clustering-0.5.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for persistable_clustering-0.5.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f1ac5b8c7305a8b007506e5f654fed8cb2a3bb2431d3d48f741565ab9488f10a
MD5 dbb2e9cb44ef5f1e2fb843e7c318c370
BLAKE2b-256 62fc213b627c91530bf56ad70cae45958b992c0b0731cc1d796c4f4bb8758d4d

See more details on using hashes here.

File details

Details for the file persistable_clustering-0.5.3-cp312-cp312-macosx_10_9_universal2.whl.

File metadata

File hashes

Hashes for persistable_clustering-0.5.3-cp312-cp312-macosx_10_9_universal2.whl
Algorithm Hash digest
SHA256 53a2e2781945007837d9dcada54e787c28118f418946672116d51a766e1176a3
MD5 26449290446e4e9fb7019993323cf4e2
BLAKE2b-256 b12e142f480b10f4ae1f74eb6c7ecb957157d968a3a63d060b2c4aee967f0d14

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page