Skip to main content

A generalised method of producing probabilistic clusters from a series of clusterings that have been generated from different representations of the same point-based data.

Project description

Welcome to FuzzyCat

License: MIT GitHub Workflow Status Documentation Status codecov

FuzzyCat is a general-purpose soft-clustering algorithm that, given a series of clusterings on point-based data, is able to produce data-driven fuzzy clusters whose membership functions encapsulate the effects of any changes in the clusterings due to changes in the feature space of the data. The fuzzy clusters are produced empirically by finding groups of clusters within the many clusterings. The different clusterings may be governed by any underlying process that affects the clusters (e.g. stochastic sampling from uncertain data, temporal evolution of the data, clustering algorithm hyperparameter variation, etc.). In effect, FuzzyCat propagates the effects of the underlying process(es) into a soft-clustering which has had these effects abstracted away into the membership functions of the original point-based data.

Installation

The Python package fuzzycat can be installed from PyPI:

python -m pip install fuzzycategories

Basic usage

FuzzyCat can be easily applied to any series of clusters that have been found from different representations of fuzzy point-based data. If the fuzzy data is actually uncertain data then these representations could (for example) be a series of random samples generated by sampling independently over each point's probability distribution. In such a scenario, the clusters found from each representation would differ by some amount that depends on the effect that the uncertainties have on the structure within the data set.

Getting some fuzzy data

To demonstrate this, we first need some data...

import numpy as np
import sklearn.datasets as data

# Generate some structured data with noise
np.random.seed(0)
background = np.random.uniform(-2, 2, (1000, 2))
moons, _ = data.make_moons(n_samples = 2000, noise = 0.1)
moons -= np.array([[0.5, 0.25]])    # centres moons on origin
gauss_1 = np.random.normal(-1.25, 0.2, (500, 2))
gauss_2 = np.random.normal(1.25, 0.2, (500, 2))

P = np.vstack([background, moons, gauss_1, gauss_2])

... however, this is not fuzzy data.

To make it fuzzy (in this scenario), we also need some description of the probability distribution of each point. The simplest version of this is to take the probability distributions as homogenous and spherically-symmetric 2-dimensional Gaussians. This would mean that the uncertainty of every point is described by Gaussian probability distributions, each having the same covariance matrix, i.e. $\sigma^2 I$ (where $\sigma$ is a constant and $I$ is the identity matrix).

So let's simply take $\sigma = 0.05$ by setting a variable covP = 0.05.

Generating different representations of the fuzzy data

In our scenario, we can generate the different representations by creating random samples of the data. Luckily, for Gaussian uncertainties, FuzzyCat comes prepared with a utility function to do this for us. At a minimum, it requires the mean-values, P, and the covariances, covP, which will produce 100 representations, run the AstroLink algorithm with its default parameters over each, and save the resultant clusters in the correct format within a new 'Clusters/' folder that will be located within the current directory.

The code to do this is simply...

from fuzzycat import FuzzyData

FuzzyData.clusteringsFromRandomSamples(P, covP)

[!NOTE] covP can be also be a 1-, 2-, or 3-dimensional np.ndarray.

For clarity, here's a gif showing the clusterings produced for each realisation...

A gif showing the random sample clusterings from AstroLink.

Applying FuzzyCat

We have now generated various clusterings from our fuzzy data, and we can see that the uncertainities are affect the clusters as the clusters change between resamplings. By applying FuzzyCat, we can collate this information into one soft clustering that encapsulates these effects.

We just need to tell it that we have nSamples = 100 clusterings of nPoints = P.shape[0] (= 4000) points and then run it...

from fuzzycat import FuzzyCat

nSamples, nPoints = 100, P.shape[0]
fc = FuzzyCat(nSamples, nPoints)
fc.run()

... and its done! FuzzyCat has found a representative soft clustering that has propagated the effects of the uncertainties into the AstroLink cluster model.

Visualising the soft clusters

With the soft clustering found, we would like to see what it looks like. This is easy, because FuzzyCat also comes equipped with some useful plotting functions. As such, we can visualise the soft structure with...

from fuzzycat import FuzzyPlots

FuzzyPlots.plotFuzzyLabelsOnX(fc, P)

... which produces a figure whereby the points of P are coloured according to their membership within each of the fuzzy clusters. For this scenario, we get...

AstroLink clusters with progagated uncertainties.

... which shows that the effect of these uncertainties on the AstroLink clusters is to give them fuzzy borders — indicated by the colours of the points fading to black and/or mixing around the boundaries of these clusters.

Development installation

If you want to contribute to the development of fuzzycat, we recommend the following editable installation from this repository:

git clone https://github.com/william-h-oliver/fuzzycat.git
cd fuzzycat
python -m pip install --editable .[tests]

Having done so, the test suite can be run using pytest:

python -m pytest

Acknowledgments

This repository was set up using the SSC Cookiecutter for Python Packages.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzycategories-0.1.0.tar.gz (7.4 MB view details)

Uploaded Source

File details

Details for the file fuzzycategories-0.1.0.tar.gz.

File metadata

  • Download URL: fuzzycategories-0.1.0.tar.gz
  • Upload date:
  • Size: 7.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.9

File hashes

Hashes for fuzzycategories-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9afaa7342f2ef37deb64913a3cb6198272be069b2b3b1586a3ac7faea3243a8b
MD5 07c06ee1505621662f0c3fef3a40badb
BLAKE2b-256 e34a0876619c6ecbdf905cd9c21b972d05c642ebbcce53300ba7d9804bd766a7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page