sprm · PyPI

Sparse Partial Robust M Regression, including plot functions

These details have not been verified by PyPI

Project links

Homepage

Project description

.. role:: raw-html-m2r(raw)
:format: html

sprm
====

Sparse partial robust M regression
==================================

Pyhton code for Sparse Partial Robust M regresion (SPRM)[1], a sparse and robust version of univariate partial least squares (PLS1).

Description
-----------

The method performs four tasks at the same time in a single, consistent estimate:

* *regression*\ : yields regression coefficients and predicts responses
* *dimension reduction*\ : calculates interpretable PLS-like components maximizing covariance to the predictand in a robust way
* *variable selection*\ : depending on the paramter settings, can yield highly sparse regression coefficients that contain exact zero elements
* *outlier detection and compensation*\ : yields a set of case weights in [0,1]. The lower the weight, the more outlying a case is. The estimate itself is outlier robust.

The code is aligned to ScikitLearn, such that modules such as GridSearchCV can flawlessly be applied to it.

The repository contains

* The estimator (sprm.py)
* Plotting functionality based on Matplotlib (as well in sprm.py)
* Robust data pre-processing (robcent.py)

How to install
--------------

The package is distributed through PyPI, so install through:

.. code-block::

pip install sprm

The SPRM estimator
==================

The main SPRM implementation yields a class with the following structure:

Dependencies
------------

* From <sklearn.base>: BaseEstimator,TransformerMixin,RegressorMixin
* From <sklearn.utils>: _BaseComposition
* copy
* From <scipy.stats>: norm,chi2
* numpy
* from :raw-html-m2r:`<matplotlib>`\ : pyplot.

Parameters
----------

* eta: float. Sparsity parameter in [0,1)
* n_components: int > 1. Note that if applied on data, n_components shall take a value <= min(x_data.shape)
* fun: str, downweighting function. 'Hampel' (recommended), 'Fair' or 'Huber'
* probp1: float, probability cutoff for start of downweighting (e.g. 0.95)
* probp2: float, probability cutoff for start of steep downweighting (e.g. 0.975, only relevant if fun='Hampel')
* probp3: float, probability cutoff for start of outlier omission (e.g. 0.999, only relevant if fun='Hampel')
* centring: str, type of centring ('mean' or 'median', the latter recommended)
* scaling: str, type of scaling ('std','mad', the latter recommended, or 'None')
* verbose: boolean, specifying verbose mode
* maxit: int, maximal number of iterations in M algorithm
* tol: float, tolerance for convergence in M algorithm
* start_cutoff_mode: str, value 'specific' will set starting value cutoffs specific to X and y (preferred); any other value will set X and y stating cutoffs identically. The non-specific setting yields identical results to the SPRM R implementation available from `CRAN <https://cran.r-project.org/web/packages/sprm/index.html>`_.
* start_X_init: str, values 'pcapp' will include a PCA/broken stick projection to calculate the initial predictor block caseweights; any other value will just calculate initial predictor block case weights based on Euclidian distances within that block. The is less stable for very flat data (p >> n).
* colums (def false): Either boolean or list. If False, no column names supplied. If a list (will only take length x_data.shape[1]), the column names of the x_data supplied in this list, will be printed in verbose mode
* copy (def True): boolean, whether to create deep copy of the data in the calculation process

Attributes
----------

* x\ *weights*\ : X block PLS weighting vectors (usually denoted W)
* x\ *loadings*\ : X block PLS loading vectors (usually denoted P)
* C_: vector of inner relationship between response and latent variablesblock re
* x\ *scores*\ : X block PLS score vectors (usually denoted T)
* coef_: vector of regression coefficients
* intercept_: intercept
* coef\ *scaled*\ : vector of scaled regression coeeficients (when scaling option used)
* intercept\ *scaled*\ : scaled intercept
* residuals_: vector of regression residuals
* x\ *ev*\ : X block explained variance per component
* y\ *ev*\ : y block explained variance
* fitted_: fitted response
* x\ *Rweights*\ : X block SIMPLS style weighting vectors (usually denoted R)
* x\ *caseweights*\ : X block case weights
* y\ *caseweights*\ : y block case weights
* caseweights_: combined case weights
* colret_: names of variables retained in the sparse model
* x\ *loc*\ : X block location estimate
* y\ *loc*\ : y location estimate
* x\ *sca*\ : X block scale estimate
* y\ *sca*\ : y scale estimate

Methods
-------

* fit(X,y): fit model
* predict(X): make predictions based on fit
* transform(X): project X onto latent space
* weightnewx(X): calculate X case weights
* getattr(): get list of attributes
* setattr(**kwargs): set individual attribute of sprm object
* valscore(X,y,scoring): option to use weighted scoring function in cross-validation if scoring=weighted

Ancillary functions
--------------------

* snipls (class): sparse NIPALS regression (first described in: [2])
* Hampel: Hampel weight function
* Huber: Huber weight function
* Fair: Fair weight function
* brokenstick: broken stick rule to estimate number of relevant principal components
* robcent (class): robust centring and scaling

Example
-------

To run a toy example:

*
Source packages and data:

.. code-block::

import pandas as ps
data = ps.read_csv("./Returns_shares.csv")
columns = data.columns[2:8]
data = data.values[:,2:8]
X = data[:,0:5]
y = data[:,5]
X0 = X.astype('float')
y0 = y.astype('float')

*
Estimate and predict by SPRM

.. code-block::

from sprm import sprm
res_sprm = sprm.sprm(2,.8,'Hampel',.95,.975,.999,'median','mad',True,100,.01,'ally','xonly',columns,True)
res_sprm.fit(X0[:2666],y0[:2666])
res_sprm.predict(X0[2666:])
res_sprm.transform(X0[2666:])
res_sprm.weightnewx(X0[2666:])
res_sprm.get_params()
res_sprm.set_params(fun="Huber")

*
Cross-validated using GridSearchCV:

.. code-block::

from sklearn.model_selection import GridSearchCV
res_sprm_cv = GridSearchCV(sprm.sprm(), cv=10, param_grid={"n_components": [1, 2, 3],
"eta": np.arange(.1,.9,.05).tolist()})
res_sprm_cv.fit(X0[:2666],y0[:2666])
res_sprm_cv.best_params_

Plotting functionality
======================

The file sprm.py also contains a set of plot functions based on Matplotlib. The class plot contains plots for sprm objects, wheras the class plot_cv contains a plot for cross-validation.

Dependencies
------------

* pandas
* numpy
* matplotlib.pyplot
* for plotting cross-validation results: sklearn.model_selection.GridSearchCV

Paramaters
----------

* res_sprm, sprm. An sprm class object that has been fit.
* colors, list of str entries. Only mandatory input. Elements determine colors as:

* [0]: borders of pane
* [1]: plot background
* [2]: marker fill
* [3]: diagonal line
* [4]: marker contour, if different from fill
* [5]: marker color for new cases, if applicable
* [6]: marker color for harsh calibration outliers
* [7]: marker color for harsh prediction outliers

* markers, a list of str entries. Elements determkine markers for:

* [0]: regular cases
* [1]: moderate outliers
* [2]: harsh outliers

Methods
-------

* plot\ *coeffs(entity="coef*\ ",truncation=0,columns=[],title=[]): Plot regression coefficients, loadings, etc. with the option only to plot the x% smallest and largets coefficients (truncation)
* plot_yyp(ytruev=[],Xn=[],label=[],names=[],namesv=[],title=[],legend_pos='lower right',onlyval=False): Plot y vs y predicted.
* plot_projections(Xn=[],label=[],components = [0,1],names=[],namesv=[],title=[],legend_pos='lower right',onlyval=False): Plot score space.
* plot_caseweights(Xn=[],label=[],names=[],namesv=[],title=[],legend_pos='lower right',onlyval=False,mode='overall'): Plot caseweights, with the option to plot 'x', 'y' or 'overall' case weights for cases used to train the model. For new cases, only 'x' weights can be plotted.

Remark
------

The latter 3 methods will work both for cases that the models has been trained with (no additional input) or new cases (requires Xn and in case of plot_ypp, ytruev), with the option to plot only the latter (option onlyval = True). All three functions have the option to plot case names if supplied as list.

Ancillary classes
-----------------

* sprmplotcv has method eta_ncomp_contour(title) to plot sklearn GridSearchCV results
* ABline2D plots the first diagonal in y vs y predicted plots.

Example (continued)
--------------------

*
initialize some values:

.. code-block::

import numpy as np
colors = ["white","#BBBBDD","#0000DD",'#1B75BC','#4D4D4F','orange','red','black']
markers = ['o','d','v']
label = ["AIG"]
names = [str(i) for i in range(1,len(res_sprm.y)+1)]
namesv = [str(i) for i in range(1,len(y0[2667:])+1)]

*
run sprm.plot:

.. code-block::

res_sprm_plot = sprm.plot(res_sprm,colors)

*
plot coefficients:

.. code-block::

res_sprm_plot.plot_coeffs(title="All AIG SPRM scaled b")
res_sprm_plot.plot_coeffs(truncation=.05,columns=columns,title="5% smallest and largest AIG sprm b")

.. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_b.png
:target: https://github.com/SvenSerneels/sprm/blob/master/AIG_b.png
:alt: AIG sprm regression coefficients

*
plot y vs y predicted, training cases only:

.. code-block::

res_sprm_plot.plot_yyp(label=label,title="AIG SPRM y vs. y predicted")
res_sprm_plot.plot_yyp(label=label,names=names,title="AIG SPRM y vs. y predicted")

.. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_yyp_train.png
:target: https://github.com/SvenSerneels/sprm/blob/master/AIG_yyp_train.png
:alt: AIG sprm y vs y predicted, taining set

*
plot y vs y predicted, including test cases

.. code-block::

res_sprm_plot.plot_yyp(ytruev=y0[2667:],Xn=X0[2667:],label=label,names=names,namesv=namesv,title="AIG SPRM y vs.
y predicted")
res_sprm_plot.plot_yyp(ytruev=y0[2667:],Xn=X0[2667:],label=label,title="AIG SPRM y vs. y predicted")

.. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_yyp_train_test.png
:target: https://github.com/SvenSerneels/sprm/blob/master/AIG_yyp_train_test.png
:alt: AIG sprm y vs y predicted, taining set

*
plot y vs y predicted, only test set cases:

.. code-block::

res_sprm_plot.plot_yyp(ytruev=y0[2667:],Xn=X0[2667:],label=label,title="AIG SPRM y vs. y predicted",onlyval=True)

*
plot score space, options as above, with the second one shown here:

.. code-block::

res_sprm_plot.plot_projections(Xn=X0[2667:],label=label,names=names,namesv=namesv,title="AIG SPRM score space, components 1 and 2")
res_sprm_plot.plot_projections(Xn=X0[2667:],label=label,title="AIG SPRM score space, components 1 and 2")
res_sprm_plot.plot_projections(Xn=X0[2667:],label=label,namesv=namesv,title="AIG SPRM score space, components 1 and 2",onlyval=True)

.. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_T12.png
:target: https://github.com/SvenSerneels/sprm/blob/master/AIG_T12.png
:alt: AIG sprm score space

*
plot caseweights, options as above, with the second one shown here:

.. code-block::

res_sprm_plot.plot_caseweights(Xn=X0[2667:],label=label,names=names,namesv=namesv,title="AIG SPRM caseweights")
res_sprm_plot.plot_caseweights(Xn=X0[2667:],label=label,title="AIG SPRM caseweights")
res_sprm_plot.plot_caseweights(Xn=X0[2667:],label=label,namesv=namesv,title="AIG SPRM caseweights",onlyval=True)

.. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_caseweights.png
:target: https://github.com/SvenSerneels/sprm/blob/master/AIG_caseweights.png
:alt: AIG sprm caseweights

*
plot cross-validation results:

.. code-block::

res_sprm_plot_cv = sprm.plot_cv(res_sprm_cv,colors)
res_sprm_plot_cv.eta_ncomp_contour()
res_sprm_plot_cv.cv_score_table_

.. image:: https://github.com/SvenSerneels/sprm/blob/master/AIG_CV.png
:target: https://github.com/SvenSerneels/sprm/blob/master/AIG_CV.png
:alt: AIG sprm CV results

References
----------

#. `Sparse partial robust M regression <https://www.sciencedirect.com/science/article/abs/pii/S0169743915002440>`_\ , Irene Hoffmann, Sven Serneels, Peter Filzmoser, Christophe Croux, Chemometrics and Intelligent Laboratory Systems, 149 (2015), 50-59.
#. `Sparse and robust PLS for binary classification <https://onlinelibrary.wiley.com/doi/abs/10.1002/cem.2775>`_\ , I. Hoffmann, P. Filzmoser, S. Serneels, K. Varmuza, Journal of Chemometrics, 30 (2016), 153-162.

Work to do
----------

* while the code is aligned with sklearn, it does not yet 100% follow the naming conventions therein
* optimize for speed
* manipulations in robcent can be written more elegantly
* suggestions always welcome

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.7.1

Apr 5, 2020

0.7.0

Apr 3, 2020

0.6.0

Mar 29, 2020

0.5.0

Mar 28, 2020

0.4.2

Mar 26, 2020

0.4.1

Dec 22, 2019

0.4.0

Dec 22, 2019

0.3.15

Dec 18, 2019

0.3.14

Dec 18, 2019

0.3.13

Dec 17, 2019

0.3.12

Jul 30, 2019

0.3.11

Jul 4, 2019

0.3.10

Jul 4, 2019

0.3.9

Jul 4, 2019

0.3.8

Jul 4, 2019

0.3.7

Jul 4, 2019

0.3.6

Jul 4, 2019

0.3.5

Jul 4, 2019

0.3.4

May 30, 2019

0.3.3

May 26, 2019

0.3.2

May 26, 2019

0.3.1

May 26, 2019

0.3.0

May 24, 2019

0.2.28

May 24, 2019

0.2.27

May 24, 2019

0.2.26

May 24, 2019

0.2.25

May 24, 2019

0.2.24

May 24, 2019

0.2.23

May 24, 2019

0.2.22

May 24, 2019

0.2.21

May 24, 2019

0.2.20

May 24, 2019

0.2.19

May 24, 2019

0.2.18

May 24, 2019

0.2.17

May 24, 2019

0.2.16

May 24, 2019

0.2.15

May 24, 2019

0.2.14

May 24, 2019

0.2.13

May 24, 2019

0.2.12

May 24, 2019

0.2.11

May 24, 2019

0.2.10

May 24, 2019

0.2.9

Feb 2, 2019

0.2.8

Feb 2, 2019

0.2.7

Feb 2, 2019

0.2.6

Feb 2, 2019

0.2.5

Feb 2, 2019

0.2.4

Feb 2, 2019

0.2.3

Feb 2, 2019

0.2.2

Feb 2, 2019

0.2.1

Feb 2, 2019

0.2.0

Jan 27, 2019

0.1.20

Jan 27, 2019

0.1.19

Jan 27, 2019

0.1.18

Jan 27, 2019

0.1.17

Jan 27, 2019

0.1.16

Jan 27, 2019

0.1.15

Jan 27, 2019

0.1.13

Jan 27, 2019

0.1.12

Jan 27, 2019

0.1.11

Jan 27, 2019

0.1.10

Jan 27, 2019

0.1.9

Jan 26, 2019

0.1.8

Jan 19, 2019

0.1.7

Jan 19, 2019

This version

0.1.6

Jul 31, 2018

0.1.5

Jul 31, 2018

0.1.4

Jul 31, 2018

0.1.3

Jul 31, 2018

0.1.2

Jul 31, 2018

0.1.1

Jul 31, 2018

0.1.0

Jul 28, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sprm-0.1.6.tar.gz (20.5 kB view hashes)

Uploaded Jul 31, 2018 Source

Hashes for sprm-0.1.6.tar.gz

Hashes for sprm-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`355313859b59f4759cd1428496579ebfc1ecd6bddec149fc506bcc22a89cb4da`
MD5	`8398b272f884848bd9fb19f665968856`
BLAKE2b-256	`6ad8afc2bfd12d62e662b6a22a0dfc7c81ccc2c4378b48e71208bfe30f95fa9e`