Contents
- Build a PLS-LDA model
- Visualization of a PLS-LDA model with two latent variables
- Visualization of a PLS-LDA model with three latent variables
- K-fold cross validation of PLSLDA
- Monte Carlo cross validation (MCCV) of PLS-LDA
- Variable selection using the original CARS method described in our CARS paper.
- Variable selection using a simplified version of CARS where the built-in
- Variable selection using Random Frog
- Variable selection using Subwindow Permutation Analysis (SPA)
- Variable selection using PHADIA (Phase Diagram)
Build a PLS-LDA model
This example illustrates how to build a PLS-LDA model using a demo metabolomic data.
%cd /Users/hdl/Documents/Study/MATLAB/libPLS/libPLS_dev % first go to the % folder of libPLS. clear; load DM2; % example data A=3; % number of latent variables (LVs). method='center'; % internal pretreatment method of X for building PLS-LDA model PLSLDA=plslda(X,y,A,method); % command to build a model
PLSLDA
PLSLDA = method: 'center' scale_para: [2x40 double] yreal: [96x1 double] regcoef_pretreat: [40x1 double] VIP: [1x40 double] Wstar: [40x3 double] W: [40x3 double] Xscores: [96x3 double] Xloadings: [40x3 double] R2X: [3x1 double] R2Y: [3x1 double] tpScores: [96x1 double] tpLoadings: [40x1 double] SR: [1x40 double] LDA: '********* LDA ***********' regcoef_lda_lv: [4x1 double] regcoef_lda_origin: [41x1 double] yfit: [96x1 double] error: 0.0833 sensitivity: 0.9500 specificity: 0.8611 AUC: 0.9435
PLSLDA is a structural data Results interpretation:
Xscores:scores of X matrix regcoef_lda_origin:regression coefficients of variables in X. The last column is the intercept. error: fitting errors on training data.
Visualization of a PLS-LDA model with two latent variables
This example illustrates how to build a PLS-LDA model using a demo metabolomic data.
%cd /Users/hdl/Documents/Study/MATLAB/libPLS/libPLS_dev % first go to the % folder of libPLS. clear; load DM2; % example data A=2; % number of latent variables (LVs). method='center'; % internal pretreatment method of X for building PLS-LDA model PLSLDA=plslda(X,y,A,method); % command to build a model
close(gcf)
plotlda(PLSLDA,0,0,[1,2]) % plot samples in PLS component space

close(gcf)
plotlda(PLSLDA,0,1,[1,2]) % plot samples in PLS component space with separating line shown.

Visualization of a PLS-LDA model with three latent variables
This example illustrates how to build a PLS-LDA model using a demo metabolomic data.
%cd /Users/hdl/Documents/Study/MATLAB/libPLS/libPLS_dev % first go to the % folder of libPLS. clear; load DM2; % example data A=3; % number of latent variables (LVs). method='center'; % internal pretreatment method of X for building PLS-LDA model PLSLDA=plslda(X,y,A,method); % command to build a model
close(gcf)
plotlda(PLSLDA,0,0,[1,2,3]) % with no separating plain

close(gcf)
plotlda(PLSLDA,0,1,[1,2,3]) % with separating plane

K-fold cross validation of PLSLDA
Illustrate how to perform K-fold cross validation of PLS-LDA models
load DM2; % example data A=6; % number of LVs K=5; % fold number for cross validations method='center'; CV=plsldacv(X,y,A,K,method);
The 1th fold for PLS-LDA finished. The 2th fold for PLS-LDA finished. The 3th fold for PLS-LDA finished. The 4th fold for PLS-LDA finished. The 5th fold for PLS-LDA finished.
CV
CV = method: 'center' Ypred: [96x6 double] y: [96x1 double] error: [0.2812 0.2188 0.1146 0.0833 0.0938 0.0729] error_min: 0.0729 optLV: 6 sensitivity: 0.9500 specificity: 0.8889 AUC: 0.9889
Results interpretation:
RMSECV: root mean squared errors of cross validation. The smaller, the better Q2: same meaning as R2 but calculated from cross validation. optLV: the number of LVs that achieves the minimal RMSECV (the highest Q2)
close(gcf) plot(CV.error); % plot RMSECV values at each number of latent variables(LVs) xlabel('NO. of LVs') % add X label ylabel('error') % add Y label

Monte Carlo cross validation (MCCV) of PLS-LDA
Illustrate how to perform MCCV for PLS-LDA modeling. like K-fold CV, MCCV is another approach for cross validation.
clear; load DM2; % example data % parameter settings A=6; method='center'; N=500; % number of Mnote Carlo sampings. ratio=0.6; MCCV=plsldamccv(X,y,A,N,ratio);
MCCV
MCCV = method: 'autoscaling' error: [0.1619 0.0524 0.0578 0.0631 0.0636 0.0656] error_min: 0.0524 optLV: 2
Results interpretation:
error: cross-validated errors at each number of LV error_min: the min errors optLV: the optimal number of LVs selected by MCCV
close(gcf); plot(MCCV.error); % plot RMSECV values at each number of latent variables(LVs) xlabel('NO. of LVs'); ylabel('error');

Variable selection using the original CARS method described in our CARS paper.
clear load DM2; % example data A=6; fold=5; N=50; CARS=carsplslda(X,y,A,fold,N);
CARS
CARS = method: 'autoscaling' W: [50x40 double] error: [1x50 double] error_min: 0.0312 iterOPT: 14 optLV: 6 ratio: [1x50 double] vsel: [18x1 double]
Results interpretation:
optLV:number of LVs of the best model vsel: selected variables (columns in X)
Diagnostic plots
close(gcf) plotcars_plslda(CARS); % Notes: in this plot,the top and middle panel shows how the number of selecte % variables and RMSECV change with interations. The bottom panel describes % how regression coefficients of each variable(each line corresponds to % a variable) change with iterations. The star vertical line indiciates the % best model with the lowest RMSECV.

Variable selection using a simplified version of CARS where the built-in
randomized sampling component was removed. So, each run of CARS will give the same results.
clear load DM2; % example data A=6; fold=5; N=50; CARS=carsplslda(X,y,A,fold,N,'center',0);
CARS
CARS = method: 'center' W: [50x40 double] error: [1x50 double] error_min: 0.0312 iterOPT: 15 optLV: 4 ratio: [1x50 double] vsel: [17x1 double]
Results interpretation:
optLV:number of LVs of the best model vsel: selected variables (columns in X)
% Diagnostic plots
close(gcf) plotcars_plslda(CARS);

Notes: in this plot,the top and middle panel shows how the number of selecte variables and RMSECV change with interations. The bottom panel describes how regression coefficients of each variable(each line corresponds to a variable) change with iterations. The star vertical line indiciates the best model with the lowest RMSECV.
Variable selection using Random Frog
clear; load DM2; % example data A=6; N=2000; method='center'; FROG=randomfrog_plslda(X,y,A,method,N);
The 1000th sampling for random frog finished. The 2000th sampling for random frog finished. Elapsed time is 25.414775 seconds.
FROG
FROG = N: 2000 Q: 2 minutes: 0.4236 method: 'center' Vrank: [1x40 double] Vtop10: [33 14 12 5 11 22 26 31 8 28] probability: [1x40 double] nVar: [1x2000 double] RMSEP: [1x2000 double]
Results interpretation:
Vrank: the top ranked variables. The first one is the most important. probability: the probability of each variable to be included in a final model. The larger, the better. This is a useful metric to measure variable importance.
close(gcf) bar(FROG.probability) xlabel('variable index'); ylabel('selection probability');

Variable selection using Subwindow Permutation Analysis (SPA)
clear; load DM2; % example data A=6; K=3; % fold for cross validation Q=10; % window size, e.g. the number of variables included in a sub-model ratio=0.6; % ratio of samples to be used as training data for each sub-model N=1000; method='center'; SPA=spa(X,y,A,K,Q,N,ratio,method);
Elapsed time is 6.014934 seconds.
SPA
SPA = method: 'center' time: 0.1002 N: 1000 ratio: 0.6000 Q: 10 nLV: [1000x1 double] NPE: [1000x1 double] PPE: [1000x40 double] DMEAN: [40x1 double] DSD: [40x1 double] interfer: [1x40 double] p: [1x40 double] COSS: [1x40 double] RankedVariable: [1x40 double]
Results interpretation:
p:p-value mesuring the significance of variables. Note that for variables inclusion of which would reduce model performances,the p-value is added by one. So, it is a pseudo-pvalue here. COSS: COnditional Synergistic Score, this is the key output of SPA and used to rank variables.
% COSS is calculated as COSS=-log10(pvalue). So a pvalue=0.05, is % corresponding to COSS value= 1.3. COSS for pvalue=0.01 is 2.
close(gcf) bar(SPA.COSS,'edgecolor','none','facecolor','b') xlim([0 41]) xlabel('variable index'); ylabel('COSS');

Variable selection using PHADIA (Phase Diagram)
clear; load DM2; % example data A=6; K=3; % fold for cross validation Q=10; % window size, e.g. the number of variables included in a sub-model ratio=0.6; % ratio of samples to be used as training data for each sub-model N=1000; method='center'; PHADIA=phadia(X,y,A,method,N,Q,K);
The 100/1000th sampling for variable selection finished. The 200/1000th sampling for variable selection finished. The 300/1000th sampling for variable selection finished. The 400/1000th sampling for variable selection finished. The 500/1000th sampling for variable selection finished. The 600/1000th sampling for variable selection finished. The 700/1000th sampling for variable selection finished. The 800/1000th sampling for variable selection finished. The 900/1000th sampling for variable selection finished. The 1000/1000th sampling for variable selection finished. ***>>> Evaluating variable importance... ***>>> Evaluating finished. ***>>> PS: Variable Selection based on Model Population Analysis. Elapsed time is 4.455757 seconds.
PHADIA
PHADIA = method: 'center' parameter_meaning: 'N Q K' parameters: [1000 10 3] TimeMinutes: 0.0743 model: {1x40 cell} nLV: [1000x1 double] PredError: [1000x1 double] DMEAN: [1x40 double] DSD: [1x40 double] p: [1x40 double]
Results interpretation:
p:p-value mesuring the significance of variables.
close(gcf);
plotphadia(PHADIA); % a 2D diagram for visualizing predictive ability of all variables.

Notes: On the 2D plot below, the green dots indicate good variables inclusion of which would on average increase model performance; red dots indicate 'bad' variables, inclusion of which would on average decrease model performance; blue dots indicate noisy variables, inclusion of which have no significant effects to model performance.