Skip to main content

Python package for the GenSVM classifier

Project description

This is the documentation of the Python package for the GenSVM classifier, introduced in GenSVM: A Generalized Multiclass Support Vector Machine by Gerrit J.J. van den Burg and Patrick J.F. Groenen.

The source code of this package is available on GitHub at: https://github.com/GjjvdBurg/PyGenSVM.

Installation

Before GenSVM can be installed, a working NumPy installation is required. Please see the installation instructions for NumPy, then install GenSVM using the instructions below.

GenSVM can be easily installed through pip:

pip install gensvm

Citing

If you use this package in your research please cite the paper, for instance using the following BibTeX entry:

@article{JMLR:v17:14-526,
  author  = {Gerrit J.J. van den Burg and Patrick J.F. Groenen},
  title   = {{GenSVM}: A Generalized Multiclass Support Vector Machine},
  journal = {Journal of Machine Learning Research},
  year    = {2016},
  volume  = {17},
  number  = {225},
  pages   = {1-42},
  url     = {http://jmlr.org/papers/v17/14-526.html}
}

Usage

The package contains two classes to fit the GenSVM model: GenSVM and GenSVMGridSearchCV. These classes respectively fit a single GenSVM model or fit a series of models for a parameter grid search. The interface to these classes is the same as that of classifiers in Scikit-Learn so users familiar with Scikit-Learn should have no trouble using this package. Below we will show some examples of using the GenSVM classifier and the GenSVMGridSearchCV class in practice.

In the examples we assume that we have loaded the iris dataset from Scikit-Learn as follows:

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.preprocessing import maxabs_scale
>>> X, y = load_iris(return_X_y=True)
>>> X = maxabs_scale(X)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y)

Note that we scale the data using the maxabs_scale function. This scales the columns of the data matrix to [-1, 1] without breaking sparsity. Scaling the dataset can have a significant effect on the computation time of GenSVM and is generally recommended for SVMs.

Example 1: Fitting a single GenSVM model

Let’s start by fitting the most basic GenSVM model on the training data:

>>> from gensvm import GenSVM
>>> clf = GenSVM()
>>> clf.fit(X_train, y_train)
GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0,
kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05,
max_iter=100000000.0, p=1.0, random_state=None, verbose=0,
weights='unit')

With the model fitted, we can predict the test dataset:

>>> y_pred = clf.predict(X_test)

Next, we can compute a score for the predictions. The GenSVM class has a score method which computes the accuracy_score for the predictions. In the GenSVM paper, the adjusted Rand index is often used to compare performance. We illustrate both options below (your results may be different depending on the exact train/test split):

>>> clf.score(X_test, y_test)
1.0
>>> from sklearn.metrics import adjusted_rand_score
>>> adjusted_rand_score(clf.predict(X_test), y_test)
1.0

We can try this again by changing the model parameters, for instance we can turn on verbosity and use the Euclidean norm in the GenSVM model by setting p = 2:

>>> clf2 = GenSVM(verbose=True, p=2)
>>> clf2.fit(X_train, y_train)
Starting main loop.
Dataset:
    n = 112
    m = 4
    K = 3
Parameters:
    kappa = 0.000000
    p = 2.000000
    lambda = 0.0000100000000000
    epsilon = 1e-06

iter = 0, L = 3.4499531579689533, Lbar = 7.3369415851139745, reldiff = 1.1266786095824437
...
Optimization finished, iter = 4046, loss = 0.0230726364692517, rel. diff. = 0.0000009998645783
Number of support vectors: 9
GenSVM(coef=0.0, degree=2.0, epsilon=1e-06, gamma='auto', kappa=0.0,
    kernel='linear', kernel_eigen_cutoff=1e-08, lmd=1e-05,
    max_iter=100000000.0, p=2, random_state=None, verbose=True,
    weights='unit')

For other parameters that can be tuned in the GenSVM model, see GenSVM.

Example 2: Fitting a GenSVM model with a “warm start”

One of the key features of the GenSVM classifier is that training can be accelerated by using so-called “warm-starts”. This way the optimization can be started in a location that is closer to the final solution than a random starting position would be. To support this, the fit method of the GenSVM class has an optional seed_V parameter. We’ll illustrate how this can be used below.

We start with relatively large value for the epsilon parameter in the model. This is the stopping parameter that determines how long the optimization continues (and therefore how exact the fit is).

>>> clf1 = GenSVM(epsilon=1e-3)
>>> clf1.fit(X_train, y_train)
...
>>> clf1.n_iter_
163

The n_iter_ attribute tells us how many iterations the model did. Now, we can use the solution of this model to start the training for the next model:

>>> clf2 = GenSVM(epsilon=1e-8)
>>> clf2.fit(X_train, y_train, seed_V=clf1.combined_coef_)
...
>>> clf2.n_iter_
3196

Compare this to a model with the same stopping parameter, but without the warm start:

>>> clf2.fit(X_train, y_train)
...
>>> clf2.n_iter_
3699

So we saved about 500 iterations! This effect will be especially significant with large datasets and when you try out many parameter configurations. Therefore this technique is built into the GenSVMGridSearchCV class that can be used to do a grid search of parameters.

Known Limitations

The following are known limitations that are on the roadmap for a future release of the package. If you need any of these features, please vote on them on the linked GitHub issues (this can make us add them sooner!).

  1. Support for sparse matrices. NumPy supports sparse matrices, as does the GenSVM C library. Getting them to work together requires some time. In the meantime, if you really want to use sparse data with GenSVM (this can lead to significant speedups!), check out the GenSVM C library.

  2. Specification of instance weights. Currently the package allows for two modes of instance weights: unit weights where each instance gets weight 1 and group weights where instances get weights inversely proportional to the size of their class. In the future, we want to allow the user to specify a vector of weights as well.

  3. Specification of class misclassification weights. Currently, incorrectly classification an object from class A to class C is as bad as incorrectly classifying an object from class B to class C. Depending on the application, this may not be the desired effect. Adding class misclassification weights can solve this issue.

Questions and Issues

If you have any questions or encounter any issues with using this package, please ask them on GitHub.

License

This package is licensed under the GNU General Public License version 3.

Copyright G.J.J. van den Burg, excluding the sections of the code that are explicitly marked to come from Scikit-Learn.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gensvm-0.1.7.tar.gz (160.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page