Contents

Build a PLS-LDA model

This example illustrates how to build a PLS-LDA model using a demo metabolomic data.

%cd /Users/hdl/Documents/Study/MATLAB/libPLS/libPLS_dev  % first go to the
%  folder of libPLS.
clear;
load DM2;                     % example data
A=3;                          % number of latent variables (LVs).
method='center';              % internal pretreatment method of X for building PLS-LDA model
PLSLDA=plslda(X,y,A,method);  % command to build a model
PLSLDA
PLSLDA = 

                method: 'center'
            scale_para: [2x40 double]
                 yreal: [96x1 double]
      regcoef_pretreat: [40x1 double]
                   VIP: [1x40 double]
                 Wstar: [40x3 double]
                     W: [40x3 double]
               Xscores: [96x3 double]
             Xloadings: [40x3 double]
                   R2X: [3x1 double]
                   R2Y: [3x1 double]
              tpScores: [96x1 double]
            tpLoadings: [40x1 double]
                    SR: [1x40 double]
                   LDA: '*********  LDA ***********'
        regcoef_lda_lv: [4x1 double]
    regcoef_lda_origin: [41x1 double]
                  yfit: [96x1 double]
                 error: 0.0833
           sensitivity: 0.9500
           specificity: 0.8611
                   AUC: 0.9435

PLSLDA is a structural data Results interpretation:

     Xscores:scores of X matrix
     regcoef_lda_origin:regression coefficients of variables in X. The
     last column is the intercept.
     error: fitting errors on training data.

Visualization of a PLS-LDA model with two latent variables

This example illustrates how to build a PLS-LDA model using a demo metabolomic data.

%cd /Users/hdl/Documents/Study/MATLAB/libPLS/libPLS_dev  % first go to the
%  folder of libPLS.

clear;
load DM2;                       % example data
A=2;                            % number of latent variables (LVs).
method='center';                % internal pretreatment method of X for building PLS-LDA model
PLSLDA=plslda(X,y,A,method);    % command to build a model
close(gcf)
plotlda(PLSLDA,0,0,[1,2]) % plot samples in PLS component space
close(gcf)
plotlda(PLSLDA,0,1,[1,2]) % plot samples in PLS component space with separating line shown.

Visualization of a PLS-LDA model with three latent variables

This example illustrates how to build a PLS-LDA model using a demo metabolomic data.

%cd /Users/hdl/Documents/Study/MATLAB/libPLS/libPLS_dev  % first go to the
%  folder of libPLS.
clear;
load DM2;                       % example data
A=3;                            % number of latent variables (LVs).
method='center';                % internal pretreatment method of X for building PLS-LDA model
PLSLDA=plslda(X,y,A,method);    % command to build a model
close(gcf)
plotlda(PLSLDA,0,0,[1,2,3])     % with no separating plain
close(gcf)
plotlda(PLSLDA,0,1,[1,2,3])     % with separating plane

K-fold cross validation of PLSLDA

Illustrate how to perform K-fold cross validation of PLS-LDA models

load DM2;                       % example data
A=6;                            % number of LVs
K=5;                            % fold number for cross validations
method='center';
CV=plsldacv(X,y,A,K,method);
The 1th fold for PLS-LDA finished.
The 2th fold for PLS-LDA finished.
The 3th fold for PLS-LDA finished.
The 4th fold for PLS-LDA finished.
The 5th fold for PLS-LDA finished.
CV
CV = 

         method: 'center'
          Ypred: [96x6 double]
              y: [96x1 double]
          error: [0.2812 0.2188 0.1146 0.0833 0.0938 0.0729]
      error_min: 0.0729
          optLV: 6
    sensitivity: 0.9500
    specificity: 0.8889
            AUC: 0.9889

Results interpretation:

     RMSECV: root mean squared errors of cross validation. The smaller,
     the better
     Q2: same meaning as R2 but calculated from cross validation.
     optLV: the number of LVs that achieves the minimal RMSECV (the highest Q2)
close(gcf)
plot(CV.error);                 % plot RMSECV values at each number of latent variables(LVs)
xlabel('NO. of LVs')            % add X label
ylabel('error')                 % add Y label

Monte Carlo cross validation (MCCV) of PLS-LDA

Illustrate how to perform MCCV for PLS-LDA modeling. like K-fold CV, MCCV is another approach for cross validation.

clear;
load DM2;                           % example data
% parameter settings
A=6;
method='center';
N=500;                              % number of Mnote Carlo sampings.
ratio=0.6;
MCCV=plsldamccv(X,y,A,N,ratio);
MCCV
MCCV = 

       method: 'autoscaling'
        error: [0.1619 0.0524 0.0578 0.0631 0.0636 0.0656]
    error_min: 0.0524
        optLV: 2

Results interpretation:

     error: cross-validated errors at each number of LV
     error_min: the min errors
     optLV: the optimal number of LVs selected by MCCV
close(gcf);
plot(MCCV.error);                   % plot RMSECV values at each number of latent variables(LVs)
xlabel('NO. of LVs');
ylabel('error');

Variable selection using the original CARS method described in our CARS paper.

clear
load DM2;                           % example data
A=6;
fold=5;
N=50;
CARS=carsplslda(X,y,A,fold,N);
CARS
CARS = 

       method: 'autoscaling'
            W: [50x40 double]
        error: [1x50 double]
    error_min: 0.0312
      iterOPT: 14
        optLV: 6
        ratio: [1x50 double]
         vsel: [18x1 double]

Results interpretation:

     optLV:number of LVs of the best model
     vsel: selected variables (columns in X)

Diagnostic plots

close(gcf)
plotcars_plslda(CARS);

% Notes: in this plot,the top and middle panel shows how the number of selecte
% variables and RMSECV change with interations. The bottom panel describes
% how regression coefficients of each variable(each line corresponds to
% a variable) change with iterations. The star vertical line indiciates the
% best model with the lowest RMSECV.

Variable selection using a simplified version of CARS where the built-in

randomized sampling component was removed. So, each run of CARS will
give the same results.
clear
load DM2;                           % example data
A=6;
fold=5;
N=50;
CARS=carsplslda(X,y,A,fold,N,'center',0);
CARS
CARS = 

       method: 'center'
            W: [50x40 double]
        error: [1x50 double]
    error_min: 0.0312
      iterOPT: 15
        optLV: 4
        ratio: [1x50 double]
         vsel: [17x1 double]

Results interpretation:

     optLV:number of LVs of the best model
     vsel: selected variables (columns in X)
% Diagnostic plots
close(gcf)
plotcars_plslda(CARS);

Notes: in this plot,the top and middle panel shows how the number of selecte variables and RMSECV change with interations. The bottom panel describes how regression coefficients of each variable(each line corresponds to a variable) change with iterations. The star vertical line indiciates the best model with the lowest RMSECV.

Variable selection using Random Frog

clear;
load DM2;                              % example data
A=6;
N=2000;
method='center';
FROG=randomfrog_plslda(X,y,A,method,N);
The 1000th sampling for random frog finished.
The 2000th sampling for random frog finished.
Elapsed time is 25.414775 seconds.
FROG
FROG = 

              N: 2000
              Q: 2
        minutes: 0.4236
         method: 'center'
          Vrank: [1x40 double]
         Vtop10: [33 14 12 5 11 22 26 31 8 28]
    probability: [1x40 double]
           nVar: [1x2000 double]
          RMSEP: [1x2000 double]

Results interpretation:

    Vrank: the top ranked variables. The first one is the most
    important.
    probability: the probability of each variable to be included in a
    final model. The larger, the better. This is a useful metric to
    measure variable importance.
close(gcf)
bar(FROG.probability)
xlabel('variable index');
ylabel('selection probability');

Variable selection using Subwindow Permutation Analysis (SPA)

clear;
load DM2;                           % example data
A=6;
K=3;                                % fold for cross validation
Q=10;                               % window size, e.g. the number of variables included in a sub-model
ratio=0.6;                          % ratio of samples to be used as training data for each sub-model
N=1000;
method='center';
SPA=spa(X,y,A,K,Q,N,ratio,method);
Elapsed time is 6.014934 seconds.
SPA
SPA = 

            method: 'center'
              time: 0.1002
                 N: 1000
             ratio: 0.6000
                 Q: 10
               nLV: [1000x1 double]
               NPE: [1000x1 double]
               PPE: [1000x40 double]
             DMEAN: [40x1 double]
               DSD: [40x1 double]
          interfer: [1x40 double]
                 p: [1x40 double]
              COSS: [1x40 double]
    RankedVariable: [1x40 double]

Results interpretation:

    p:p-value mesuring the significance of variables. Note that for
    variables inclusion of which would reduce model performances,the
    p-value is added by one. So, it is a pseudo-pvalue here.
    COSS: COnditional Synergistic Score, this is the key output of SPA
    and used to rank variables.
%      COSS is calculated as COSS=-log10(pvalue). So a pvalue=0.05, is
%      corresponding to COSS value= 1.3. COSS for pvalue=0.01 is 2.
close(gcf)
bar(SPA.COSS,'edgecolor','none','facecolor','b')
xlim([0 41])
xlabel('variable index');
ylabel('COSS');

Variable selection using PHADIA (Phase Diagram)

clear;
load DM2;                           % example data
A=6;
K=3;                                % fold for cross validation
Q=10;                               % window size, e.g. the number of variables included in a sub-model
ratio=0.6;                          % ratio of samples to be used as training data for each sub-model
N=1000;
method='center';
PHADIA=phadia(X,y,A,method,N,Q,K);
The 100/1000th sampling for variable selection finished.
The 200/1000th sampling for variable selection finished.
The 300/1000th sampling for variable selection finished.
The 400/1000th sampling for variable selection finished.
The 500/1000th sampling for variable selection finished.
The 600/1000th sampling for variable selection finished.
The 700/1000th sampling for variable selection finished.
The 800/1000th sampling for variable selection finished.
The 900/1000th sampling for variable selection finished.
The 1000/1000th sampling for variable selection finished.
***>>> Evaluating variable importance...
***>>> Evaluating finished.
***>>> PS: Variable Selection based on Model Population Analysis.
Elapsed time is 4.455757 seconds.
PHADIA
PHADIA = 

               method: 'center'
    parameter_meaning: 'N Q K'
           parameters: [1000 10 3]
          TimeMinutes: 0.0743
                model: {1x40 cell}
                  nLV: [1000x1 double]
            PredError: [1000x1 double]
                DMEAN: [1x40 double]
                  DSD: [1x40 double]
                    p: [1x40 double]

Results interpretation:

     p:p-value mesuring the significance of variables.
close(gcf);
plotphadia(PHADIA);                 % a 2D diagram for visualizing predictive ability of all variables.
     Notes:
     On the 2D plot below, the green dots indicate good variables inclusion of
     which would on average increase model performance; red dots
     indicate 'bad' variables, inclusion of which would on average
     decrease model performance; blue dots indicate noisy variables,
     inclusion of which have no significant effects to model
     performance.