This library provides a whole set of easy-to-use functions for building partial least squares (PLS) regression (PLSR) and discriminant analysis (PLS-DA) models as well as predictive performance evaluation. Towards building a reliable model, we also implemented a number of commonly used outlier detection and variable selection methods that can be used to "clean" your data by removing potential outliers and using only a sub-set of selected variables.
The algorithms in the current version cover:
Paritial Least Squares
Linear Discriminant Analysis
leave-one-out cross validation(LOOCV)
K-fold cross validation
double cross validation (DCV)
Monte Carlo cross validation (MCCV)
repeated double cross validation (RDCV)
Using an independent test set
The Monte Carlo method
Variable importance in projection(VIP)
Target Projection (TP)
Uninformative Variable Elimination (UVE, also MC-UVE)
Competitive Adaptive Reweighted Sampling (CARS-PLS, CARS-PLSDA)
Random Frog (coupled with PLS or PLS-DA)
Subwindow Permutation Analysis (coupled with PLS-DA)
Moving Window Partial Least Squares(MWPLS)
libPLS is currently written in MATLAB 7.10.0(R2010a). The source codes of libPLS are freely available. The library is free for scientific use but not for commercial purposes.
It is assumed that you have basic working knowledge of Matlab. First, download the current version of libPLS and unzip it. Then, you need to set your current work folder to be "libPLS_1.0" or add "libPLS_1.0" to the search path of Matlab. After this, you are all set. The example data corn_m51.mat included in this library will be used to show the usage of functions. To know how to use the functions in this library, you are suggested to open the source code file and see what input and output means.
First, load the example near infrared(NIR) data to MATLAB work space by typing the following to the COMMAND window, and press Enter.
In the MATLAB workspace as shown above, you will then see two variables. One is X of size 80x700(NIR spectra of 80 corn samples measured at 700 wavelength channels), the other is y of size 80x1 (moisture content of the 80 samples). To get a taste of this data, you may want to know what the NIR spectra look like. Running the following codes will give you a plot of all the 80 spectra, which is shown in the right panel of the figure above:
plot(X'); set(gcf,'color','w'); xlabel('wavelength'); ylabel('intensity');
To build a calibration model that can predict corn's moisture content given its spectrum and evaluate its performance, you want to first partition the data into a calibration set and a test set. There are actually 280-1 possible partitions in this case, and it is very hard to see which partition is optimal. From a statistical perspective, we hope that samples in calibration would be "representitive". Here, we use the state-of-the-art Kennard-Stone (KS)for data partition:
Rank=ks(X); %+++ Data partition using Kennard-Stone algorithm Xcal=X(Rank(1:60),:); ycal=y(Rank(1:60),:); Xtest=X(Rank(61:80),:); ytest=y(Rank(61:80),:);
As a result, you will get a calibration set and a test set, which contain 60 and 20 samples, respectively. The number of optimal latent varibales(LV) needs to be determined before building a PLS model. There are various ways for this task. Here 10-fold cross validation is used to give a reasonable choice (with the maximal number of nLV limited to 15):
CV=plscv(Xcal,ycal,15,10); %by default, 'center' is used for data pretreatment inside plscv.m. plot(CV.RMSECV,'bo-','linewidth',2); xlabel('number of latent variables'); ylabel('RMSECV'); set(gcf,'color','w');
Running the above codes, a figure displaying RMSECV (Root Mean Squared Error of Cross Validation) against number of LVs will be produced as following:
From this figure, it can be found that the RMSECV value decrease quickly first and then shows no big changes after 10 components. So, using 10 components to build a PLS model may be your resonable choice. In addition, you can type "CV" in the command window and get the detailed output from cross validation which may be useful for your model assessment, as shown below:
CV = method: 'center' Ypred: [60x15 double] predError: [60x15 double] RMSECV: [1x15 double] RMSECV_min: 0.0117 Q2: [1x15 double] Q2_max: 0.9991 optLV: 15
where optLV is the suggested optimal number of latent variables for PLS modeling determined by the lowest RMSECV and Q2_max is the corresponding Q2. Note that there are different ways for choosing the optimal number of PLS components. Choosing optLV to be that associated with the lowest RMSECV is not necessarily the optimal, just a reasonable option.
Suppose that you choose 10 LVs (it is also reasonable if you choose 9 or 12 or 15), then you can build a PLS model by simply running:
PLS=pls(Xcal,ycal,10,'center'); %+++ Build a PLS model with 10 components, pretreat method is 'center'.
Type "PLS" in the command window and press Enter, and you will see a structural data which contains a number of components that you may be interested, such as scores (X_scores), loadings(X_loadings), weight (W), W*(Wstar), regression coefficients (regcoef_pretreat) for pretreated data (X_pretreat, y_pretreat), regression coefficients(regcoef_original) for original input data (Xcal, ycal), R2, fitting error (RMSEF), variable importance in projection (VIP), SR(selectivity ratio) etc:
>> PLS PLS method: 'center' X_pretreat: [60x700 double] y_pretreat: [60x1 double] regcoef_pretreat: [700x1 double] regcoef_original_all: [701x10 double] regcoef_original: [701x1 double] X_scores: [60x10 double] X_loadings: [700x10 double] VIP: [700x1 double] W: [700x10 double] Wstar: [700x10 double] y_fit: [60x1 double] fitError: [60x1 double] tpscores: [60x1 double] tploadings: [700x1 double] SR: [1x700 double] SST: 9.3201 SSR: 9.3105 SSE: 0.0095 RMSEF: 0.0126 R2: 0.9990
Generally, model validation can be done using either independent test sets or cross validation. Both have pros and cons.There is no free lunch. The test set generated above can be used to evaluate the built PLS model using the following commands:
[ypred,RMSEP]=plsval(PLS,Xtest,ytest); %+++ make predictions on test set figure; plot(ytest,ypred,'.',ytest,ytest,'r-'); xlabel('experimental'); ylabel('predicted'); set(gcf,'color','w');
From the generated figure as shown above, you can visually insepct the agreement between the experimental value and the predicted. But you want to have a quantiative mesurement of the predictive performance. RMSEP (Root Mean Squared Error of Prediction) is such a metric. Type "RMSEP" into command window, and you will see:
>> RMSEP RMSEP = 0.0152
which is, as expected, higher than the RMSEF value (0.0126) of the calibration data.
When you do not have enough samples, setting aside some samples as test set may not be a good choice since it comes at the cost of a smaller calibration set that may contain less information than the whole data. In this case, a model can be evaluated through cross validation which provides an estimation of model performance by using all samples. The plscv.m function introduced above implements K-fold cross validation. When K=n, it is right the LOOCV. Here, another two methods, i.e. MCCV and RDCV, will be introduced. The codes for running MCCV is:
A=15; method='center'; N=1000; % number of Monte Carlo sampling ratio=0.6; % ratio of samples to be used as sub-calibration set inside MCCV MCCV=plsmccv(Xcal,ycal,A,method,N,ratio);
Type "MCCV" in command window, and you will get:
>> MCCV MCCV = method: 'center' MC_para: [1000 0.6000] RMSECV: [1x15 double] RMSECV_min: 0.0148 Q2: [1x15 double] Q2_max: 0.9986 optLV: 15 RMSEF: [1000x15 double] RMSEP: [1000x15 double] time: 23.3661
You can plot RMSECV or Q2 to see how model performance evolves as the number of latent variables increases. In the output of MCCV, we also provided the RMSEP values corresponding to the 1000 sub-test set generated at each iteration of MCCV, it would be interesting to check the distribution of these RMSEP values.
For RDCV, you can run the following codes:
A=15; K=5; method='center'; N=50; % number of repeat RDCV=plsrdcv(Xcal,ycal,A,K,method,N);
Take a look at the output in RDCV:
>> RDCV RDCV = method: 'center' nLV: [250x1 double] predError: [3000x1 double] RMSEP: [250x1 double]
The results from RDCV have 250 (K*N) optimal LV values, 3000 predErrors (60*N, 60 is sample size) and 250 (K*N) RMSEP values. Taking a look at the distribution of these metrics would give you a lot of information about model variability. We also implemented RDCV in this package, but will not describe it here since it is a special case of RDCV where the number of repeats is set to 1.
Real world data may not be "clean". They may contain outliers which may make the model inaccurate or useless. There are two types of methods for handling outliers: outlier detection and robust modeling. There are many methods for detecting outliers, such h-values, Mahalanobis distance etc. What we implemented here is the MC method. Run the following codes:
%+++ Outlier detection ycal_error=ycal; ycal_error(1:3)=ycal_error(1:3)*1.2; % mannually make 3 outliers in y-value MCS=mcs(Xcal,ycal_error,10,'center',1000,0.6) figure; plotmcs(MCS); set(gcf,'color','w');
And you will obtain a diagnostic figure as below:
From this figure, it can be seen that the first three samples have a very large mean of prediction errors, indicating their abnormality. Outliers, once confirmed, should be removed from training data. But removing samples as outliers should be done carefully.
Variable selection is very important for building a parsimoneous and high performance model, especially in the "large p, small n" situation where a large portion of measured variables can be irrelevant to the property you are modeling. These are the interfering variables. The VIP, SR and loadings can be used for variable selection, which are already introduced in Section 3.1. Here we will introduce Monte Carlo version of UVE (MCUVE), CARS, Random Frog and MWPLS briefly.
The following codes are for running MCUVE:
A=15; method='center'; N=1000; ratio=0.75; UVE=mcuvepls(Xcal,ycal,A,method,N,ratio); plot(abs(UVE.RI),'linewidth',2); xlabel('variable index'); ylabel('reliability index'); set(gcf,'color','w');
A figure displaying the reliability index (RI) of each variable will be generated as following. This RI value can be used to assess variable importance, and therefore variables of low RI can be eliminated. (Notice that in the original UVE method, noise variables were introduced as a reference for obtaining thereshold for eliminating variables. Noise variables are not used in MCUVE).
Competitive Adaptive Reweighted Sampling (CARS) is a "survival of the fittest" based variable selection method, which usually single out a small number of variables for modeling. Run the following codes:
%+++ CARS-PLS for variable selection A=18; % A can be set to a large number. K=5; % fold number of CV for model validation N=50; %+++ number of evolution CARS=carspls(Xcal,ycal,A,K,method,N); plotcars(CARS); set(gcf,'color','w');
You will have a figure as below:
In this lower panel of this figure, you can identify an optimal subset of variables to be the one with the lowest RMSECV value as indicated by the sterisk line. Type "CARS" in the command window, you will get its output in detail as below where the 'vsel' stores the optimal variable set (notice that since CARS has embedded random factor, so your results my not be exactly the same as shown below):.
>> CARS CARS = W: [700x50 double] time: 3.1852 RMSECV: [1x50 double] RMSECV_min: 2.9460e-04 Q2_max: 1.0000 iterOPT: 48 optLV: 2 ratio: [1x50 double] vsel: [405 505]
Random frog borrows idea from reversible jump MCMC and is very efficient for variable selection of high dimensional data since it uses only a small number of variables for modeling during iteration. It will output a selection probablility for each variable, describing importance of variables. Random frog also belongs to the category of model population analysis (MPA) approaches. Run the codes below:
A=10; method='center'; N=2000; %+++ here N is set to be small only for demo. Usually it is set to be 10000 or larger. Q=2; %+++ the number of variables in the initial model for jumping Frog=randomfrog_pls(Xcal,ycal,A,method,2000,Q); plot(Frog.probability); xlabel('variable index'); ylabel('selection probability'); set(gcf,'color','w');
, and you will get a figure where the selection probability of each variables is displayed. The larger the probability is, the more important the corresponding variable would be.
Moving window PLS is a method that selects wavelength band rather than individual wavelengths. The following codes is for running MWPLS:
%+++ moving window PLS A=15; % number of maximal LVs width=15; % window size [WP,RMSEF]=mwpls(Xcal,ycal,A,width); plot(WP,RMSEF); xlabel('wavelength'); ylabel('RMSEF'); set(gcf,'color','w');
This will generate a figure as below
In this figure, the fitting errors of each moving window is shown as a function of the wavelength (window position). Wavelength bands associated with low fitting errors can be visually identified and used for building a calibration model.
1. Wold, S., M. Sjöström, and L. Eriksson, 2001. PLS-regression: a basic tool of chemometrics. Chemometr. Intell. Lab. 58 (2001)109-130. PDF
2. Kennard, R.W. and L.A. Stone, 1969. Computer aided design of experiments. Technometrics 11 (1969)137-148. PDF
3. Shao, J., 1993. Linear Model Selection by Cross-Validation. J Am. Stat. Assoc. 88 (1993)486-494. PDF
4. Xu, Q.-S. and Y.-Z. Liang, 2001. Monte Carlo cross validation. Chemometr. Intell. Lab. 56 (2001)1-11. PDF
5. Filzmoser, P., B. Liebmann, and K. Varmuza, 2009. Repeated double cross validation. J Chemometr 23 (2009)160-171. PDF
6. Cao, D.S., Y.Z. Liang, Q.S. Xu, H.D. Li, and X. Chen, A New Strategy of Outlier Detection for QSAR/QSPR. J Comput Chem 31 592-602.PDF
7. Centner, V., D.-L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, and C. Sterna, 1996. Elimination of Uninformative Variables for Multivariate Calibration. Anal. Chem. 68 (1996)3851-3858. PDF
8. Cai, W., Y. Li, and X. Shao, 2008. A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra. Chemometr. Intell. Lab. 90 (2008)188-194. PDF
9. Rajalahti, T., R. Arneberg, A.C. Kroksveen, M. Berle, K.-M. Myhr, and O.M. Kvalheim, 2009. Discriminating Variable Test and Selectivity Ratio Plot: Quantitative Tools for Interpretation and Variable (Biomarker) Selection in Complex Spectral or Chromatographic Profiles. Anal. Chem. 81 (2009)2581-2590. PDF
10. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2009. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta 648 (2009)77-84. PDF
11. Li, H.-D., Q.-S. Xu, and Y.-Z. Liang, 2012. Random Frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for gene selection and disease classification. Anal Chim Acta 740 (2012)20-26. PDF
12. Jiang, J.-H., R.J. Berry, H.W. Siesler, and Y. Ozaki, 2002. Wavelength Interval Selection in Multicomponent Spectral Analysis by Moving Window Partial Least-Squares Regression with Applications to Mid-Infrared and Near-Infrared Spectroscopic Data. Anal. Chem. 74 (2002)3555-3565. PDF
13. Li, H.-D., Y.-Z. Liang, and Q.-S. Xu, 2010. Uncover the path from PCR to PLS via elastic component regression. Chemometr. Intell. Lab. 104 (2010)341-346. PDF
14. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2009. Model population analysis for variable selection. J. Chemometr. 24 (2009)418-423. PDF
15. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2012. Model population analysis and its applications in chemical and biological modeling. TrAC 38 (2012)154-162. PDF
Author: Hongdong Li(email@example.com), Advisor: Yizeng Liang (firstname.lastname@example.org), College Of Chemistry and Chemical Engineering, Central South University, Changsha 410083, PR China. If any comments or questions, please let us know.