(Not too) deep clustering

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Not Too Deep Clustering

This is a library implementation of n2d. To learn more about N2D, and clustering manifolds of autoencoded embeddings, please refer to the amazing paper published August 2019.

What is it?

Not too deep clustering is a state of the art "deep" clustering technique, in which first, the data is embedded using an autoencoder. Then, instead of clustering that using some deep clustering network, we use a manifold learner to find the underlying (local) manifold in the embedding. Then, we cluster that manifold. In the paper, this was shown to produce high quality clusters without the standard extreme feature engineering required for clustering.

In this repository, a framework for A) reproducing the study and B) extending the study is given, for further research and use in a variety of applications

Usage

First, lets load in some data. In this example, we will use the Human Activity Recognition(HAR) dataset. In this dataset, sets of time series with data from mobile devices is used to classify what the person is doing (walking, sitting, etc.)

import datasets as data
x,y, y_names = data.load_har()

Next, lets set up our deep learning environment, as well as load in necessary libraries:

import os
import random as rn
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use(['seaborn-white', 'seaborn-paper'])
sns.set_context("paper", font_scale=1.3)
matplotlib.use('agg')

import tensorflow as tf
from keras import backend as K

# set up environment
os.environ['PYTHONHASHSEED'] = '0'


rn.seed(0)
tf.set_random_seed(0)
np.random.seed(0)

if len(K.tensorflow_backend._get_available_gpus()) > 0:
    print("Using GPU")
    session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,
                                  inter_op_parallelism_threads=1,
                                  )
    sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
    K.set_session(sess)

Finally, we are ready to get clustering!

import n2d as nd

n_clusters = 6  #there are 6 classes in HAR

# Initialize everything
harcluster = nd.n2d(x, nclust = n_clusters)

The first step in using this framework is to initialize an n2d object with the dataset and the number of clusters. The primary purpose of this step is to set up the autoencoder for training.

Next, we pretrain the autoencoder. In this step, you can fiddle with batch size etc. On the first run of the autoencoder, we want to include the weight_id parameter, which saves the weights in weights/, so we do not have to train the autoencoder repeatedly during our experiments.

harcluster.preTrainEncoder(weight_id = "har")

The next time we want to use this autoencoder, we will instead use the weights argument:

harcluster.preTrainEncoder(weights = "har-1000-ae_weights.h5")

The next important step is to define the manifold clustering method to be used:

manifoldGMM = nd.UmapGMM(n_clusters)

Now we can make a prediction, as well as visualize and assess

harcluster.predict(manifoldGMM)
# predictions are stored in harcluster.preds
harcluster.visualize(y, y_names, dataset = "har", nclust = n_clusters)
print(harcluster.assess(y))
# (0.81212, 0.71669, 0.64013)

Before viewing the results, lets talk about the metrics. The first metric is cluster accuracy, which we see here is 81.2%, which is absolutely state of the art for the HAR dataset. The next metric is NMI, which is another metric which describes cluster quality based on labels, independent of the number of clusters. We have an NMI of 0.717, which is again absolutely state of the art for this dataset. The last metric, ARI, shows another comparison between the actual groupings and our grouping. A value of 1 means the groupings are nearly the same, while a value of 0 means they completely disagree. We have a value of 0.64013, which indicates that are predictions are more or less in agreement with the truth, however they are not perfect.

N2D prediction

Actual clusters

Extending

So far, this framework only includes the method for manifold clustering which the authors of the paper deemed best, umap with gaussian mixture clustering. Lets say however we want to try out spectral clustering instead:

from sklearn.cluster import SpectralClustering
import umap
class UmapSpectral:
    def __init__(self, nclust,
                 umapdim = 2,
                 umapN = 10,
                 umapMd = float(0),
                 umapMetric = 'euclidean',
		 random_state = 0
                 ):
        self.nclust = nclust
	# change this bit for changing the manifold learner
        self.manifoldInEmbedding = umap.UMAP(
            random_state = random_state,
            metric = umapMetric,
            n_components = umapdim,
            n_neighbors = umapN,
            min_dist = umapMd
        )
	# change this bit to change the clustering mechanism
	self.clusterManifold = SpectralClustering(
		n_clusters = nclust
		affinity = 'nearest_neighbors',
		random_state = random_state
	)


    def predict(self, hl):
    # obviously if you change the clustering method or the manifold learner
    # youll want to change the predict method too.
        hle = self.manifoldInEmbedding.fit_transform(hl)
        self.clusterManifold.fit(hle)
	y_pred = self.clusterManifold.fit_predict(hle)
        return(y_pred)

Now we can run and assess our new clustering method:

manifoldSC = UmapSpectral(6)
harcluster.predict(manifoldSC)
print(harcluster.assess(y))
# (0.40946, 0.42137, 0.14973)

This clearly did not go as well, however we can see that it is very easy to extend this library. We could also try out swapping UMAP for ISOMAP, the clustering method with kmeans, or maybe with a deep clustering technique.

Roadmap

Package library
Implement other promising methods
Make assessment/visualization more extensible
Documentation?
Find an elegant way to deal with pre training weights
Package on Nix
Blog post?

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.2

Jan 16, 2020

0.3.1

Jan 16, 2020

0.2.5

Jan 15, 2020

0.2.2

Dec 18, 2019

0.1.9

Nov 22, 2019

0.1.7

Nov 18, 2019

0.1.6

Nov 17, 2019

0.1.5

Nov 17, 2019

0.1.2

Nov 7, 2019

0.0.5

Oct 31, 2019

0.0.4

Oct 30, 2019

0.0.3

Oct 30, 2019

0.0.2

Oct 30, 2019

This version

0.0.1

Oct 29, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

n2d-0.0.1.linux-x86_64.tar.gz (4.0 kB view details)

Uploaded Oct 29, 2019 Source

File details

Details for the file n2d-0.0.1.linux-x86_64.tar.gz.

File metadata

Download URL: n2d-0.0.1.linux-x86_64.tar.gz
Upload date: Oct 29, 2019
Size: 4.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/3.7.4

File hashes

Hashes for n2d-0.0.1.linux-x86_64.tar.gz
Algorithm	Hash digest
SHA256	`e3d7a4679a2babc4c927540375146d11d20df99d2f20567f1f6dcf11c8d24fe3`
MD5	`531b7edba14769f217e7e03440ee2be9`
BLAKE2b-256	`ff96acc28b1cfbe22a8b24632485b5d1e89dd95b62de523a74c617cc256eaf65`

See more details on using hashes here.

n2d 0.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Not Too Deep Clustering

What is it?

Usage

Extending

Roadmap

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes