Skip to main content

Prototype-based Machine Learning on Distance Data.

Project description

Prototype-based Machine Learning on Distance Data

Copyright (C) 2019 - Benjamin Paassen
Machine Learning Research Group
Center of Excellence Cognitive Interaction Technology (CITEC)
Bielefeld University

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses/.

Introduction

This scikit-learn compatible, Python3 library provides several algorithms to learn prototype models on distance data. At this time, this library features the following algorithms:

Refer to the Quickstart Guide for a note on how to use these models and refer to the Background section for more details on the algorithms.

Note that this library follows the

If you intend to use this library in academic work, please cite the respective reference paper.

Installation

This package is available on pypi as proto_dist_ml. You can install it via

pip install proto_dist_ml

QuickStart Guide

For an example we recommend to take a look at the demo in the notebook demo.ipynb. In general, all models in this library follow the scikit-learn convention, i.e. you need to perform the following steps:

  1. Instantiate your model, e.g. via model = proto_dist_ml.rng.RNG(K) where K is the number of prototypes.
  2. Fit your model to training data, e.g. via model.fit(D), where D is the matrix of pairwise distances between your training data points.
  3. Perform a prediction for test data, e.g. via model.predict(D), where D is the matrix of distances from the test to the training data points.

Background

The basic idea of prototype models is that we can cluster and classify data by assigning them to the cluster/class of the closest prototype, where a prototype is a data point that represents the cluster/class well.

In case of distance data, we can not express a prototype in vectorial form but instead need to use an indirect form, namely a convex combination of existing data points. In other words, our $k$th prototype $w_k$ is defined as

\vec w_k = \sum_{i=1}^m \alpha_{k, i} \cdot \vec x_i
\qquad \text{where } \sum_{i=1}^m \alpha_{k, i} = 1
\text{ and } \alpha_{k, i} \geq 0 \quad \forall i

where $\vec x_1, \ldots, \vec x_m$ are the training data points and where $\alpha_{k, 1}, \ldots, \alpha_{k, m}$ are the convex coefficiants representing prototype $w_k$. Because the prototype is fully specified by the data and the convex coefficients, we do not need any explicit form for $w_k$ anymore.

To cluster/classify new data, we now need to determine the distance between a data point $x$ and a prototpe $w_k$. As it turns out, this distance can also be expressed solely in terms of the convex coefficients and the data-to-data distances. In particular, we obtain:

d(x, w_k)^2 = \sum_{i=1}^m \alpha_{k, i} \cdot d(x, x_i)^2
- \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_{k, i} \cdot \alpha_{k, j} \cdot d(x_i, x_j)^2

In matrix-vector notation we obtain:

d(x, w_k)^2 = {\vec \alpha_k}^T \cdot \vec d^2
- \frac{1}{2} {\vec \alpha_k}^T \cdot D^2 \cdot \vec \alpha_k

where $\vec d^2$ the vector of distances between $x$ and all training data points $x_i$ and where $D^2$ is the distance matrix between the training data points.

The main challenge for distance-based prototype learning is now to optimize the coefficients $\alpha_{k, i}$ according to some meaningful loss function. The loss function and its optimization differs between the algorithms. In more detail, we take the following approaches.

Relational Neural Gas

Relational neural gas (RNG; Hammer and Hasenfuss, 2007) is a clustering approach that tries to optimize the loss function

\sum_{i=1}^m \sum_{k=1}^K h_{i, k} \cdot d(x_i, w_k)^2

where $h_{i, k}$ quantifies how responsible prototype $w_k$ is for data point $x_i$. This term is calculated as follows:

h_{i, k} = \exp(-r_{i, k} / \lambda) \qquad \text{where } r_{i, k} = |\{ l | d(x_i, w_l) < d(x_i, w_k) \}|

In other words, $w_k$ is the $r_{i, k}$-closest prototype to data point $x_i$ and the lower ranked a prototype is (i.e. the closer it is), the higher is its responsibility for the data point. $\lambda$ is a scaling factor that expresses how many prototypes are still considered. Per default, we start with $\lambda = K / 2$ and then anneal $\lambda$ until $\lambda = 0.01$, i.e. only the closest prototype is considered. At that point, the loss above is equivalent to the $K$-means loss.

Given the current values for $h_{i, k}$, optimizing the convex coefficients $\alpha_{k, i}$ is possible in closed form. In particular, we obtain: $\alpha_{k, i} = h_{i, k} / \sum_j h_{j, k}$. The RNG training procedure thus consists of three steps which are iterated in each training epoch:

  1. Compute the responsibilities $h_{i, k}$.
  2. Compute the new convex coefficients $\alpha_{k, i}$.
  3. Decrease $\lambda$.

Relational Generalized Learning Vector Quantization

Relational generalized learning vector quantization (RGLVQ; Hammer, Hofmann, Schleif, and Zhu, 2014) is a classification approach which aims at optimizing the generalized learning vector quantization loss:

\sum_{i=1}^m \Phi\Big(\frac{d_i^+ - d_i^-}{d_i^+ + d_i^-}\Big)

where $d_i^+$ is the distance of data point $x_i$ to the closest prototype with the same label, $d_i^-$ is the distance of data point $x_i$ to the closest prototype with a different label, and $\Phi$ is a squashing function (such as tanh or the logistic function). Note that data point $x_i$ is correctly classified if and only if $d_i^+ - d_i^- < 0$. As such, the GLVQ loss can be regarded as a soft approximation of the classification error.

Note that this loss has the drawback that distances need to be strictly positive in order to guarantee a nonzero denominator. This excludes non-Euclidean distances (i.e. distances which do not correspond to an inner product) because these may imply negative data-to-prototype distances.

We optimize this loss via L-BFGS, restricting the coefficients to be convex in each step. The gradient follows directly from the formula above and the distance formula above. For more details, refer to (Hammer et al., 2014).

Median Generalized Learning Vector Quantization

Median generalized learning vector quantization (MGLVQ; Nebel, Hammer, Frohberg, and Villmann, 2015) is a variant of GLVQ that restricts prototypes to be strictly data points, i.e. for each prototype $w_k$ there exists exactly one $i$, such that $\alpha_{k, i} = 1$ and every other coefficient is zero. This has two key advantages. First, it permits non-Euclidean and even asymmetric distances because we do not rely on an interpolation between data points. Second, it is more efficient during classification because we can compute the distances to the prototypes directly and do not need to use the relational distance formula above.

However, MGLVQ is also more challenging to train because we can not perform a smooth gradient method but instead must apply a discrete optimization scheme. In this toolbox, we optimize the GLVQ loss (see above) via greedy hill climbing, i.e. we try to find any prototype-datapoint combination that would reduce the loss and apply the first such combination we find until no such move exists anymore.

Contents

This library contains the following files.

  • demo.ipynb : A demo script illustrating how to use this library.
  • LICENSE.md : A copy of the GPLv3 license.
  • mglvq_test.py : A set of unit tests for mglvq.py.
  • proto_dist_ml/mglvq.py : An implementation of median generalized learning vector quantization.
  • proto_dist_ml/rglvq.py : An implementation of relational generalized learning vector quantization.
  • proto_dist_ml/rng.py : An implementation of relational neural gas.
  • README.md : This file.
  • rglvq_test.py : A set of unit tests for rglvq.py.
  • rng_test.py : A set of unit tests for rng.py.

Licensing

This library is licensed under the GNU General Public License Version 3.

Dependencies

This library depends on NumPy for matrix operations, on scikit-learn for the base interfaces and on SciPy for optimization.

Literature

  • Hammer, B. & Hasenfuss, A. (2007). Relational Neural Gas. Proceedings of the 30th Annual German Conference on AI (KI 2007), 190-204. doi:10.1007/978-3-540-74565-5_16. Link
  • Hammer, B., Hofmann, D., Schleif, F., & Zhu, X. (2014). Learning vector quantization for (dis-)similarities. Neurocomputing, 131, 43-51. doi:10.1016/j.neucom.2013.05.054. Link
  • Nebel, D., Hammer, B., Frohberg, K., & Villmann, T. (2015). Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing, 169, 295-305. doi:10.1016/j.neucom.2014.12.096

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

proto_dist_ml-1.0.1.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

proto_dist_ml-1.0.1-py3-none-any.whl (30.5 kB view details)

Uploaded Python 3

File details

Details for the file proto_dist_ml-1.0.1.tar.gz.

File metadata

  • Download URL: proto_dist_ml-1.0.1.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.2 pkginfo/1.4.2 requests/2.20.0 setuptools/41.2.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.4

File hashes

Hashes for proto_dist_ml-1.0.1.tar.gz
Algorithm Hash digest
SHA256 323345e263e427307cca22b7e8574c49c05712beeafd78256362bb4efe657c46
MD5 815679a50b9ec396b9c36ab5f20eb0ce
BLAKE2b-256 a7cbbfce0087fe044838ff1c6d7730fba2865b9fe63a6ca7a507e4787dd262a4

See more details on using hashes here.

File details

Details for the file proto_dist_ml-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: proto_dist_ml-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 30.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.2 pkginfo/1.4.2 requests/2.20.0 setuptools/41.2.0 requests-toolbelt/0.8.0 tqdm/4.28.1 CPython/3.7.4

File hashes

Hashes for proto_dist_ml-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 865e6230ea16be6f817876de4c36181e87f087b68872bca5e7d0c165f2e94f1e
MD5 cd3b98b0add015f0520ee1e186f69b40
BLAKE2b-256 98a71fb559b58caf104dda5ade55500c9e434103278a2f0f8237fd7b18ca59ef

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page