Prototypebased Machine Learning on Distance Data.
Project description
Prototypebased Machine Learning on Distance Data
Copyright (C) 2019  Benjamin Paassen
Machine Learning Research Group
Center of Excellence Cognitive Interaction Technology (CITEC)
Bielefeld University
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, see http://www.gnu.org/licenses/.
Introduction
This scikitlearn compatible, Python3 library provides several algorithms to learn prototype models on distance data. At this time, this library features the following algorithms:
 Relational Neural Gas (Hammer and Hasenfuss, 2007) for clustering,
 Relational Generalized Learning Vector Quantization (Hammer, Hofmann, Schleif, and Zhu, 2014) for classification, and
 Median Generalized Learning Vector Quantization (Nebel, Hammer, Frohberg, and Villmann, 2015) for classification.
Refer to the Quickstart Guide for a note on how to use these models and refer to the Background section for more details on the algorithms.
Note that this library follows the
If you intend to use this library in academic work, please cite the respective reference paper.
Installation
This package is available on pypi
as proto_dist_ml
. You can install
it via
pip install proto_dist_ml
QuickStart Guide
For an example we recommend to take a look at the demo in the notebook
demo.ipynb
. In general, all models in this library follow the scikitlearn
convention, i.e. you need to perform the following steps:
 Instantiate your model, e.g. via
model = proto_dist_ml.rng.RNG(K)
whereK
is the number of prototypes.  Fit your model to training data, e.g. via
model.fit(D)
, whereD
is the matrix of pairwise distances between your training data points.  Perform a prediction for test data, e.g. via
model.predict(D)
, whereD
is the matrix of distances from the test to the training data points.
Background
The basic idea of prototype models is that we can cluster and classify data by assigning them to the cluster/class of the closest prototype, where a prototype is a data point that represents the cluster/class well.
In case of distance data, we can not express a prototype in vectorial form but
instead need to use an indirect form, namely a convex combination of existing
data points. In other words, our $k
$th prototype $w_k
$ is defined as
\vec w_k = \sum_{i=1}^m \alpha_{k, i} \cdot \vec x_i
\qquad \text{where } \sum_{i=1}^m \alpha_{k, i} = 1
\text{ and } \alpha_{k, i} \geq 0 \quad \forall i
where $\vec x_1, \ldots, \vec x_m
$ are the training data points and where
$\alpha_{k, 1}, \ldots, \alpha_{k, m}
$ are the convex coefficiants
representing prototype $w_k
$. Because the prototype is fully specified by
the data and the convex coefficients, we do not need any explicit form for
$w_k
$ anymore.
To cluster/classify new data, we now need to determine the distance between a
data point $x
$ and a prototpe $w_k
$. As it turns out, this distance can
also be expressed solely in terms of the convex coefficients and the
datatodata distances. In particular, we obtain:
d(x, w_k)^2 = \sum_{i=1}^m \alpha_{k, i} \cdot d(x, x_i)^2
 \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m \alpha_{k, i} \cdot \alpha_{k, j} \cdot d(x_i, x_j)^2
In matrixvector notation we obtain:
d(x, w_k)^2 = {\vec \alpha_k}^T \cdot \vec d^2
 \frac{1}{2} {\vec \alpha_k}^T \cdot D^2 \cdot \vec \alpha_k
where $\vec d^2
$ the vector of distances between $x
$ and all training
data points $x_i
$ and where $D^2
$ is the distance matrix between the
training data points.
The main challenge for distancebased prototype learning is now to optimize
the coefficients $\alpha_{k, i}
$ according to some meaningful loss function.
The loss function and its optimization differs between the algorithms. In more
detail, we take the following approaches.
Relational Neural Gas
Relational neural gas (RNG; Hammer and Hasenfuss, 2007) is a clustering approach that tries to optimize the loss function
\sum_{i=1}^m \sum_{k=1}^K h_{i, k} \cdot d(x_i, w_k)^2
where $h_{i, k}
$ quantifies how responsible prototype $w_k
$ is for
data point $x_i
$. This term is calculated as follows:
h_{i, k} = \exp(r_{i, k} / \lambda) \qquad \text{where } r_{i, k} = \{ l  d(x_i, w_l) < d(x_i, w_k) \}
In other words, $w_k
$ is the $r_{i, k}
$closest prototype to data point
$x_i
$ and the lower ranked a prototype is (i.e. the closer it is), the higher
is its responsibility for the data point. $\lambda
$ is a scaling factor that
expresses how many prototypes are still considered. Per default, we start with
$\lambda = K / 2
$ and then anneal $\lambda
$ until $\lambda = 0.01
$, i.e.
only the closest prototype is considered. At that point, the loss above is
equivalent to the $K
$means loss.
Given the current values for $h_{i, k}$, optimizing the convex coefficients
$\alpha_{k, i}
$ is possible in closed form. In particular, we obtain:
$\alpha_{k, i} = h_{i, k} / \sum_j h_{j, k}
$. The RNG
training procedure thus consists of three steps which are iterated in each
training epoch:
 Compute the responsibilities $
h_{i, k}
$.  Compute the new convex coefficients $
\alpha_{k, i}
$.  Decrease $
\lambda
$.
Relational Generalized Learning Vector Quantization
Relational generalized learning vector quantization (RGLVQ; Hammer, Hofmann, Schleif, and Zhu, 2014) is a classification approach which aims at optimizing the generalized learning vector quantization loss:
\sum_{i=1}^m \Phi\Big(\frac{d_i^+  d_i^}{d_i^+ + d_i^}\Big)
where $d_i^+
$ is the distance of data point $x_i
$ to the closest prototype
with the same label, $d_i^
$ is the distance of data point $x_i
$ to the
closest prototype with a different label, and $\Phi
$ is a squashing
function (such as tanh or the logistic function).
Note that data point $x_i
$ is correctly classified if and only
if $d_i^+  d_i^ < 0
$. As such, the GLVQ loss can be regarded as a soft
approximation of the classification error.
Note that this loss has the drawback that distances need to be strictly positive in order to guarantee a nonzero denominator. This excludes nonEuclidean distances (i.e. distances which do not correspond to an inner product) because these may imply negative datatoprototype distances.
We optimize this loss via LBFGS, restricting the coefficients to be convex in each step. The gradient follows directly from the formula above and the distance formula above. For more details, refer to (Hammer et al., 2014).
Median Generalized Learning Vector Quantization
Median generalized learning vector quantization
(MGLVQ; Nebel, Hammer, Frohberg, and Villmann, 2015) is a variant
of GLVQ that restricts prototypes to be strictly data points, i.e. for each
prototype $w_k
$ there exists exactly one $i
$, such that $\alpha_{k, i} = 1
$
and every other coefficient is zero. This has two key advantages. First, it
permits nonEuclidean and even asymmetric distances because we do not rely on
an interpolation between data points. Second, it is more efficient during
classification because we can compute the distances to the prototypes directly
and do not need to use the relational distance formula above.
However, MGLVQ is also more challenging to train because we can not perform a smooth gradient method but instead must apply a discrete optimization scheme. In this toolbox, we optimize the GLVQ loss (see above) via greedy hill climbing, i.e. we try to find any prototypedatapoint combination that would reduce the loss and apply the first such combination we find until no such move exists anymore.
Contents
This library contains the following files.
demo.ipynb
: A demo script illustrating how to use this library.LICENSE.md
: A copy of the GPLv3 license.mglvq_test.py
: A set of unit tests formglvq.py
.proto_dist_ml/mglvq.py
: An implementation of median generalized learning vector quantization.proto_dist_ml/rglvq.py
: An implementation of relational generalized learning vector quantization.proto_dist_ml/rng.py
: An implementation of relational neural gas.README.md
: This file.rglvq_test.py
: A set of unit tests forrglvq.py
.rng_test.py
: A set of unit tests forrng.py
.
Licensing
This library is licensed under the GNU General Public License Version 3.
Dependencies
This library depends on NumPy for matrix operations, on scikitlearn for the base interfaces and on SciPy for optimization.
Literature
 Hammer, B. & Hasenfuss, A. (2007). Relational Neural Gas. Proceedings of the 30th Annual German Conference on AI (KI 2007), 190204. doi:10.1007/9783540745655_16. Link
 Hammer, B., Hofmann, D., Schleif, F., & Zhu, X. (2014). Learning vector quantization for (dis)similarities. Neurocomputing, 131, 4351. doi:10.1016/j.neucom.2013.05.054. Link
 Nebel, D., Hammer, B., Frohberg, K., & Villmann, T. (2015). Median variants of learning vector quantization for learning of dissimilarity data. Neurocomputing, 169, 295305. doi:10.1016/j.neucom.2014.12.096
Project details
Release history Release notifications  RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for proto_dist_ml1.0.1py3noneany.whl
Algorithm  Hash digest  

SHA256  865e6230ea16be6f817876de4c36181e87f087b68872bca5e7d0c165f2e94f1e 

MD5  cd3b98b0add015f0520ee1e186f69b40 

BLAKE2b256  98a71fb559b58caf104dda5ade55500c9e434103278a2f0f8237fd7b18ca59ef 