A generalised method of producing probabilistic clusters from a series of clusterings that have been generated from different representations of the same point-based data.
Project description
Welcome to FuzzyCat
FuzzyCat is a general-purpose soft-clustering algorithm that, given a series of clusterings on point-based data, is able to produce data-driven fuzzy clusters whose membership functions encapsulate the effects of any changes in the clusterings due to changes in the feature space of the data. The fuzzy clusters are produced empirically by finding groups of clusters within the many clusterings. The different clusterings may be governed by any underlying process that affects the clusters (e.g. stochastic sampling from uncertain data, temporal evolution of the data, clustering algorithm hyperparameter variation, etc.). In effect, FuzzyCat propagates the effects of the underlying process(es) into a soft-clustering which has had these effects abstracted away into the membership functions of the original point-based data.
Installation
The Python package fuzzycat
can be installed from PyPI:
python -m pip install fuzzycategories
Basic usage
FuzzyCat can be easily applied to any series of clusters that have been found from different representations of fuzzy point-based data. If the fuzzy data is actually uncertain data then these representations could (for example) be a series of random samples generated by sampling independently over each point's probability distribution. In such a scenario, the clusters found from each representation would differ by some amount that depends on the effect that the uncertainties have on the structure within the data set.
Getting some fuzzy data
To demonstrate this, we first need some data...
import numpy as np
import sklearn.datasets as data
# Generate some structured data with noise
np.random.seed(0)
background = np.random.uniform(-2, 2, (1000, 2))
moons, _ = data.make_moons(n_samples = 2000, noise = 0.1)
moons -= np.array([[0.5, 0.25]]) # centres moons on origin
gauss_1 = np.random.normal(-1.25, 0.2, (500, 2))
gauss_2 = np.random.normal(1.25, 0.2, (500, 2))
P = np.vstack([background, moons, gauss_1, gauss_2])
... however, this is not fuzzy data.
To make it fuzzy (in this scenario), we also need some description of the probability distribution of each point. The simplest version of this is to take the probability distributions as homogenous and spherically-symmetric 2-dimensional Gaussians. This would mean that the uncertainty of every point is described by Gaussian probability distributions, each having the same covariance matrix, i.e. $\sigma^2 I$ (where $\sigma$ is a constant and $I$ is the identity matrix).
So let's simply take $\sigma = 0.05$ by setting a variable covP = 0.05
.
Generating different representations of the fuzzy data
In our scenario, we can generate the different representations by creating random samples of the data. Luckily, for Gaussian uncertainties, FuzzyCat comes prepared with a utility function to do this for us. At a minimum, it requires the mean-values, P
, and the covariances, covP
, which will produce 100 representations, run the AstroLink algorithm with its default parameters over each, and save the resultant clusters in the correct format within a new 'Clusters/' folder that will be located within the current directory.
The code to do this is simply...
from fuzzycat import FuzzyData
FuzzyData.clusteringsFromRandomSamples(P, covP)
[!NOTE]
covP
can be also be a 1-, 2-, or 3-dimensionalnp.ndarray
.
For clarity, here's a gif showing the clusterings produced for each realisation...
Applying FuzzyCat
We have now generated various clusterings from our fuzzy data, and we can see that the uncertainities are affect the clusters as the clusters change between resamplings. By applying FuzzyCat, we can collate this information into one soft clustering that encapsulates these effects.
We just need to tell it that we have nSamples = 100
clusterings of nPoints = P.shape[0]
(= 4000) points and then run it...
from fuzzycat import FuzzyCat
nSamples, nPoints = 100, P.shape[0]
fc = FuzzyCat(nSamples, nPoints)
fc.run()
... and its done! FuzzyCat has found a representative soft clustering that has propagated the effects of the uncertainties into the AstroLink cluster model.
Visualising the soft clusters
With the soft clustering found, we would like to see what it looks like. This is easy, because FuzzyCat also comes equipped with some useful plotting functions. As such, we can visualise the soft structure with...
from fuzzycat import FuzzyPlots
FuzzyPlots.plotFuzzyLabelsOnX(fc, P)
... which produces a figure whereby the points of P
are coloured according to their membership within each of the fuzzy clusters. For this scenario, we get...
... which shows that the effect of these uncertainties on the AstroLink clusters is to give them fuzzy borders — indicated by the colours of the points fading to black and/or mixing around the boundaries of these clusters.
Development installation
If you want to contribute to the development of fuzzycat
, we recommend
the following editable installation from this repository:
git clone https://github.com/william-h-oliver/fuzzycat.git
cd fuzzycat
python -m pip install --editable .[tests]
Having done so, the test suite can be run using pytest
:
python -m pytest
Acknowledgments
This repository was set up using the SSC Cookiecutter for Python Packages.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file fuzzycategories-0.1.0.tar.gz
.
File metadata
- Download URL: fuzzycategories-0.1.0.tar.gz
- Upload date:
- Size: 7.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/4.0.2 CPython/3.11.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9afaa7342f2ef37deb64913a3cb6198272be069b2b3b1586a3ac7faea3243a8b |
|
MD5 | 07c06ee1505621662f0c3fef3a40badb |
|
BLAKE2b-256 | e34a0876619c6ecbdf905cd9c21b972d05c642ebbcce53300ba7d9804bd766a7 |