Skip to main content

Orthogonal Projection to Latent Structures

Project description

pyopls

Orthogonal Projection to Latent Structures in Python.

This package provides a scikit-learn-style transformer to perform OPLS. OPLS is a pre-processing method to remove variation from the descriptor variables that are orthogonal to the target variable (1).

This package also provides a class to validate OPLS models using a 1-component PLS regression with cross-validation and permutation tests (2) for both regression and classification metrics (from permutations of the target) and feature PLS loadings (from permutations of the features).

A 1-component PLS regression is performed to evaluate the filtering.

Notes:

  • The implementation provided here is equivalent to that of the libPLS MATLAB library, which is a faithful recreation of Trygg and Wold's algorithm.
    • This package uses a different definition for R2X, however (see below)
  • OPLS inherits sklearn.base.TransformerMixin (like sklearn.decomposition.PCA) but does not inherit sklearn.base.RegressorMixin because it is not a regressor like sklearn.cross_decomposition.PLSRegression. You can use the output of OPLS.transform() as an input to another regressor or classifier.
  • Like sklearn.cross_decomposition.PLSRegression, OPLS will center both X and Y before performing the algorithm. This makes centering by class in PLS-DA models unnecessary.
  • The score() function of OPLS performs the R2X score, the ratio of the variance in the transformed X to the variance in the original X. A lower score indicates more orthogonal variance removed.
  • OPLS only supports 1-column targets.

Examples

Perform OPLS and PLS-DA on wine dataset

OPLS-processed data require only 1 PLS component. Performing a 4-component OPLS improves accuracy from 95% to 100% and DQ2 (3) from 0.76 to 0.84.

import pandas as pd
import numpy as np
from pyopls import OPLS
from sklearn.datasets import load_wine
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_predict, LeaveOneOut
from sklearn.metrics import r2_score, accuracy_score

wine_data = load_wine()
df = pd.DataFrame(wine_data['data'], columns=wine_data['feature_names'])
df['classification'] = wine_data['target']
df = df[df.classification.isin((0, 1))]
target = df.classification.apply(lambda x: 1 if x else -1)  # discriminant for class 1 vs class 0
X = df[[c for c in df.columns if c != 'classification']]
opls = OPLS(4)
Z = opls.fit_transform(X, target)

pls = PLSRegression(1)
y_pred = cross_val_predict(pls, X, target, cv=LeaveOneOut())
q_squared = r2_score(target, y_pred)  # 0.733
dq_squared = r2_score(target, np.clip(y_pred, -1, 1))  # 0.759
accuracy = accuracy_score(target, np.sign(y_pred))  # 0.954

processed_y_pred = cross_val_predict(pls, Z, target, cv=LeaveOneOut())
processed_q_squared = r2_score(target, processed_y_pred)  # 0.836
processed_dq_squared = r2_score(target, np.clip(processed_y_pred, -1, 1))  # 0.838
processed_accuracy = accuracy_score(target, np.sign(processed_y_pred))  # 1.0

r2_X = opls.score(X)  # 0.0000320 (most variance is removed)

Validation

The fit() method of OPLSValidator will find the optimum number of components to remove, then evaluate the results on a 1-component sklearn.cross_decomposition.PLSRegression model. A permutation test is performed for each metric by permuting the target and for the PLS loadings by permuting the features.

This snippet will determine the best number of components to remove then plot the ROC curves of the classifiers on processed and unprocessed data and plot the PLS and OPLS scores for the samples.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyopls import OPLSValidator
from sklearn.datasets import load_wine
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_predict, LeaveOneOut
from sklearn.metrics import roc_curve, roc_auc_score

wine_data = load_wine()
df = pd.DataFrame(wine_data['data'], columns=wine_data['feature_names'])
df['classification'] = wine_data['target']
df = df[df.classification.isin((0, 1))]
target = df.classification.apply(lambda x: 1 if x else -1)  # discriminant for class 1 vs class 0
X = df[[c for c in df.columns if c!='classification']]

validator = OPLSValidator(k=-1).fit(X, target)

Z = validator.opls_.transform(X)
pls = PLSRegression(1)
y_pred = cross_val_predict(pls, X, target, cv=LeaveOneOut())
processed_y_pred = cross_val_predict(pls, Z, target, cv=LeaveOneOut())

fpr, tpr, thresholds = roc_curve(target, y_pred)
roc_auc = roc_auc_score(target, y_pred)
proc_fpr, proc_tpr, proc_thresholds = roc_curve(target, processed_y_pred)
proc_roc_auc = roc_auc_score(target, processed_y_pred)

plt.figure(0)
plt.plot(fpr, tpr, lw=2, color='blue', label=f'Unprocessed (AUC={roc_auc:.4f})')
plt.plot(proc_fpr, proc_tpr, lw=2, color='red',
         label=f'{validator.n_components_}-component OPLS (AUC={proc_roc_auc:.4f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

plt.figure(1)
df['t'] = validator.pls_.x_scores_
df['t_ortho'] = validator.opls_.T_ortho_[:, 0]
pos_df = df[target==1]
neg_df = df[target==-1]
plt.scatter(neg_df['t'], neg_df['t_ortho'], c='blue', label='Class 0')
plt.scatter(pos_df['t'], pos_df['t_ortho'], c='red', label='Class 1')
plt.title('PLS Scores')
plt.xlabel('t_ortho')
plt.ylabel('t')
plt.legend(loc='lower right')
plt.show()

ROC Curve

roc curve

Scores Plot

scores plot

References

  1. Johan Trygg and Svante Wold. Orthogonal projections to latent structures (O-PLS). J. Chemometrics 2002; 16: 119-128. DOI: 10.1002/cem.695
  2. Eugene Edington and Patrick Onghena. "Calculating P-Values" in Randomization tests, 4th edition. New York: Chapman & Hall/CRC, 2007, pp. 33-53. DOI: 10.1201/9781420011814.
  3. Johan A. Westerhuis, Ewoud J. J. van Velzen, Huub C. J. Hoefsloot, Age K. Smilde. Discriminant Q-squared for improved discrimination in PLSDA models. Metabolomics 2008; 4: 293-296. DOI: 10.1007/s11306-008-0126-2

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyopls-20.2.tar.gz (15.4 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page