Build a PLS regression model

This example illustrates how to build a PLS model using a benchmark NIR data.

```%cd /Users/hdl/Documents/Study/MATLAB/libPLS/libPLS_dev  % first go to the
%  folder of libPLS.
%**************************************************
%**********   NOTES  ******************************
%** MAKE SURE to go to your OWN libPLS folder  ****
%**************************************************

clear;
```
```close(gcf);
plot(X');               % display spectral data.
xlabel('wavelength index');
ylabel('intensity');
``` parameter setting

```A=6;                    % number of latent variables (LVs).
method='center';        % internal pretreatment method of X for building PLS model
PLS=pls(X,y,A,method);  % command to build a model
```
```PLS
```
```PLS =

method: 'center'
X_pretreat: [80x700 double]
y_pretreat: [80x1 double]
regcoef_pretreat: [700x1 double]
regcoef_original_all: [701x6 double]
regcoef_original: [701x1 double]
X_scores: [80x6 double]
VIP: [1x700 double]
W: [700x6 double]
Wstar: [700x6 double]
y_fit: [80x1 double]
fitError: [80x1 double]
tpscores: [80x1 double]
SR: [1x700 double]
SST: 11.4295
SSR: 11.3476
SSE: 0.0820
RMSEF: 0.0320
R2: 0.9928

```

The pls.m function returns an object PLS containing a list of components: Results interpretation:

```     regcoef_original: regression coefficients that links X and y.
X_scores: scores of X
VIP: variable importance in projection, a criterion for assessing
importance of variables
RMSEF: root mean squared errors of fitting.
y_fit: fitted values of y.
R2: percentage of explained variances of y.```

K-fold cross validation of PLS

Illustrate how to perform K-fold cross validation of PLS models

```clear;
A=6;                          % number of LVs
K=5;                          % fold number for cross validations
method='center';
CV=plscv(X,y,A,K,method);
```
```The 1th group finished.
The 2th group finished.
The 3th group finished.
The 4th group finished.
The 5th group finished.
```
```close(gcf)
plot(CV.RMSECV)               % plot RMSECV values at each number of latent variables(LVs)
xlabel('NO. of LVs')          % add X label
``` ```CV
```
```CV =

method: 'center'
Ypred: [80x6 double]
predError: [80x6 double]
RMSECV: [0.2972 0.2505 0.1782 0.0870 0.0538 0.0369]
Q2: [0.3816 0.5606 0.7777 0.9470 0.9798 0.9905]
RMSECV_min: 0.0369
Q2_max: 0.9905
optLV: 6
note: '*** The following is based on global min MSE + 1SD'
RMSECV_min_1SD: 0.0538
Q2_max_1SD: 0.9798
optLV_1SD: 5

```

The returned value CV is structural data with a list of components: Results interpretation:

```     RMSECV: root mean squared errors of cross validation. The smaller,
the better
Q2: same meaning as R2 but calculated from cross validation.
optLV: the number of LVs that achieves the minimal RMSECV (the highest Q2)```

Monte Carlo cross validation (MCCV) of PLS

Illustrate how to perform MCCV for PLS modeling. like K-fold CV, MCCV is another approach for cross validation.

```clear;

% parameter settings
A=6;
method='center';
N=500;                          % number of Mnote Carlo sampings.
MCCV=plsmccv(X,y,A,method,N);   % run MCCV
```
```The 100th sampling for MCCV finished.
The 200th sampling for MCCV finished.
The 300th sampling for MCCV finished.
The 400th sampling for MCCV finished.
The 500th sampling for MCCV finished.
```
```close(gcf);
plot(MCCV.RMSECV);              % plot RMSECV values at each number of latent variables(LVs)
xlabel('NO. of LVs');
ylabel('RMSECV');
``` ```MCCV
```
```MCCV =

method: 'center'
MC_para: [500 0.8000]
Ypred: [8000x6 double]
Ytrue: [8000x1 double]
RMSECV: [0.3023 0.2547 0.1792 0.0889 0.0577 0.0385]
RMSECV_min: 0.0385
Q2: [0.3625 0.5475 0.7761 0.9449 0.9768 0.9897]
Q2_max: 0.9897
optLV: 6
RMSEF: [500x6 double]
RMSEP: [500x6 double]
note: '****** The following is based on global min MSE + 1SD'
RMSECV_min_1SD: 0.0577
Q2_max_1SD: 0.9768
optLV_1SD: 5
time: 1.4960

```

MCCV is an structural data. Results interpretation:

```     Ypred:predicted values
Ytrue: real values
RMSECV: root mean squared errors of cross validation. The smaller,
the better
Q2: same meaning as R2 but calculated from cross validation.```

Double cross validation (DCV) of PLS

Illustrate how to perform DCV for PLS modeling. like K-fold CV, DCV is a way for cross validation.

```clear;

% parameter settings
A=6;
k=5;
method='center';
N=50;                                 % number of Mnote Carlo sampings
DCV=plsdcv(X,y,A,k,method,N);
```
```DCV
```
```DCV =

method: 'center'
RMSECV: 0.0369
nLV: [5x1 double]
RMSEP: [5x1 double]
predError: [80x1 double]

```

Outlier detection using the Monte Carlo sampling approach

Illustrate the use of an outlier detection method

```clear;
A=6;
method='center';
N=1000;
ratio = 0.7;
F=mcs(X,y,A,method,N,ratio);
```
```The 100/1000th MCS finished.
The 200/1000th MCS finished.
The 300/1000th MCS finished.
The 400/1000th MCS finished.
The 500/1000th MCS finished.
The 600/1000th MCS finished.
The 700/1000th MCS finished.
The 800/1000th MCS finished.
The 900/1000th MCS finished.
The 1000/1000th MCS finished.
```
```F
```
```F =

pretreat: 'center'
MCS_parameter: [1000 0.7000]
predError: [1000x80 double]
MEAN: [1x80 double]
STD: [1x80 double]

```

Results interpretation:

```     predError: prediction errors of the samples in each sampling
MEAN: mean predidction errors of each sample
STD:standard deviation of prediction errors of each sample```
```close(gcf)
plotmcs(F) % Diagnostic plots
``` Notes: samples with high MEAN or high SD are more likely to be outliers and should be considered to be removed before modeling.

Variable selection using the CARS method.

```clear;
A=6;
fold=5;
CARS=carspls(X,y,A,fold);
```
```Sampling...
Evaluating...
Done...
```
```CARS
```
```CARS =

W: [700x50 double]
time: 0.4639
RMSECV: [1x50 double]
RMSECV_min: 2.9460e-04
Q2_max: 1.0000
iterOPT: 49
optLV: 2
ratio: [1x50 double]
vsel: [405 505]

```

Results interpretation:

```     optLV:number of LVs of the best model
vsel: selected variables (columns in X)```
```close(gcf)
plotcars(CARS); % Diagnostic plots
``` Note: in this plot,the top and middle panel shows how the number of selecte variables and RMSECV change with interations. The bottom panel describes how regression coefficients of each variable(each line corresponds to a variable) change with iterations. The star vertical line indiciates the best model with the lowest RMSECV.

Variable selection using moving window PLS (MWPLS).

```clear;
A=6;
width=15;                           % window size
[WP,RMSEF]=mwpls(X,y,A,width);
```
```close(gcf);
plot(WP,RMSEF);
xlabel('Window position');
ylabel('RMSEF')
``` Note: From this plot, regions with low RMSEF values are recommended to be included in a PLS model

Variable selection using Monte Carlo Uninformative Variable Elimination (MCUVE)

```clear;
A=6;
N=500;
method='center';
UVE=mcuvepls(X,y,A,method,N);
```
```The 100/500th sampling for MC-UVE finished.
The 200/500th sampling for MC-UVE finished.
The 300/500th sampling for MC-UVE finished.
The 400/500th sampling for MC-UVE finished.
The 500/500th sampling for MC-UVE finished.
```
```UVE
```
```UVE =

RI: [1x700 double]
SortedVariable: [1x700 double]
regcoef: [500x700 double]
MEAN: [1x700 double]
SD: [1x700 double]

```
```close(gcf);
plot(abs(UVE.RI))
``` Results interpretation: RI: reliability index of UVE, a measurement of variable importance. The higher, the better.

Variable selection using Random Frog

```clear;
A=6;
N=10000;
method='center';
FROG=randomfrog_pls(X,y,A,method,N);
```
```The 1000th sampling for random frog finished.
The 2000th sampling for random frog finished.
The 3000th sampling for random frog finished.
The 4000th sampling for random frog finished.
The 5000th sampling for random frog finished.
The 6000th sampling for random frog finished.
The 7000th sampling for random frog finished.
The 8000th sampling for random frog finished.
The 9000th sampling for random frog finished.
The 10000th sampling for random frog finished.
Elapsed time is 40.095188 seconds.
```
```FROG
```
```FROG =

N: 10000
Q: 2
model: [10000x700 double]
minutes: 0.6683
method: 'center'
Vrank: [1x700 double]
Vtop10: [505 405 506 400 408 233 235 249 248 515]
probability: [1x700 double]
nVar: [1x10000 double]
RMSEP: [1x10000 double]

```
```close(gcf);
plot(FROG.probability)
xlabel('variable index');
ylabel('selection probability');
``` Results interpretation:

```    model: a matrix storing the selecte variables at each interation
probability: the probability of each variable to be included in a
final model. The larger, the better. This is a useful metric to
measure variable importance.```