CARS, Competitive Adaptive Reweighted Sampling, is a variable or feature selection methods, especially suitable for p>>n setting and data with high correlation (coupled with PLS). Here I will give an example to detail how to use CARS and the meaning of its output.
Suppose that you have MATLAB installed (I am using 2010a version on Mac), downloaded the libPLS_1.95 and have added libPLS_1.95 in the search path.
First, let’s get our data. Copy the codes below to the command window and press ENTER:
%+++ load data
load(‘corn_m51.mat’);
[n,p]=size(X);
Keep in mind that this data has n=80 samples which were measured on p=700 wavelengths. The target variable to model is stored in vector y. In this p>>n setting, ordinary least squares (OLS) can not be used to build a model y=Xb. Why? because b is calculated as b=inv(X’*X)X’y (inv indicates inverse, ‘ indicates transpose) in OLS. Due to p > n, X’*X is not full rank and hence do NOT have its inverse and therefore OLS fails in this situation. One type of solution is to use regularized methods such as ridge regression or latent variable-based methods such as principal component regression (PCR) and partial least squares (PLS). Why not simply reduce p? A good idea! This motivates another type of solution which is the well known branch of statistics: variable selection or called feature selection.
Then, we split the data into a training and a test set, which will be used for model training and validation, respectively. There are different ways to split data, such as the KS (Kennard-Stone) method and random partition. Here, I will randomly split the data using the codes below:
%+++ split data into a training and test set
perm=randperm(n);
ktrain=perm(1:60); % 60 samples as training
ktest=perm(61:80); % 20 samples as test
Xtrain=X(ktrain,:);
Xtest=X(ktest,:);
ytrain=y(ktrain);
ytest=y(ktest);
At this point, data is ready! We can start to select a subset of variables using the CARS method. One may ask why not use the full spectral data? A good question to think about. Run the codes below to select variables (wavelengths):
%+++ perform variable selection using only the training data
A=10; % maximum number of PLS components
fold=5; % fold for cross validation
method=’center’; % for internal pretreatment
num=50; % number of iterations of CARS, 50 is default
CARS=carspls(Xtrain,ytrain,A,fold,method,num,0,0);
To see the output of CARS, type “CARS” in command window and press ENTER and you will see:
>> CARS
CARS =
W: [700×50 double]
time: 2.8165
RMSECV: [1×50 double]
RMSECV_min: 3.0894e-04
Q2_max: 1.0000
iterOPT: 49
optLV: 2
ratio: [1×50 double]
vsel: [405 505]
Technically, CARS is a structural data with different components like optLV as shown above. Your can use CARS.RMSECV to fetch the value of RMSECV. The meaning of each component of CARS is given here:
W: [700×50 double] % weight of each variable at each iteration.
time: 2.8165
RMSECV: [1×50 double] % Root Mean Squares Errors of Cross Validation, one for each iteration
RMSECV_min: 3.0894e-04 % the minimum of RMSECV
Q2_max: 1.0000 % Q2 is the cross validated R2.
iterOPT: 49 % the optimal iteration number (corresponding to the RMSECV_min)
optLV: 2 % the optimal number of Latent Variables of PLS model at iterOPT.
ratio: [1×50 double]
vsel: [405 505] % The selected variables. THIS IS WHAT WE WANT.
Recall that this data contains p=700 variables. Here only two out of the 700 were selected to be useful by CARS: the 405th and the 505th. To understand the behavior of CARS, let’s issue the following command:
plotcars(CARS)
which gives you a plot:
In Figure 1, the upper panel shows the number of variables kept by CARS at each iteration; the middle panel plots the the RMSECV at each iteration, which shows that RMSECV decreases with reduced number of variables (interesting? why?); the bottom panel provides the regression coefficients of each of the 700 variables at each iteration, called Regression Coefficient Paths (RCP) where each line denotes the RCP of one variable. Eliminated variables will have RCP ending with zero (thus you can not see it); some variables will have their RCP first increase and then decrease and finally drop to zero (notice iterations around 30); some variables survived along the whole 50 iterations of CARS and are winners and are selected by CARS. Notice that there two lines (just the 405th and 505th variables) clearly stand out!
We have selected 2 variables. Then let’s check out how only these two will perform. The variable-reduced data is:
Xtrain_sub=Xtrain(:,[405 505]);
Xtest_sub=Xtest(:,[405 505]);
The we build a model and test it:
model=pls(Xtrain_sub,ytrain,2);
ypred=plsval(model,Xtest_sub,ytest)
An intuitive way to check model performance is to plot predicted value against the experimental value:
plot(ytest, ypred,’b.’);
xlabel(‘Experimental’);
ylabel(‘Predicted’);
This gives you a plot:
You want to give a quantitative measurement of model performance. Usually, you calculate RMSEP (Root Mean Squared Errors of Prediction):
RMSEP=sqrt(sumsqr(ypred-ytest)/length(ytest));
which is 3.0285e-04 in this case, very small (this data is very special).
So far, we have selected a subset of variables (405 and 505) using CARS and tested the performance of this subset on an independent test set. You may ask whether using a subset will improve model performance over the full-spectral model? It will as long as the full-model contains information that is irrelevant to the target variable you are predicting. As a test case, you can build a model using the full training data and tested it on the full test set and compares it to the simpler model with only two selected variables.
Hi
how can i perform a singular partition on the Data set ?
and is there any code for Principal Component Regression regard the same subject?
Do you mean singular value decomposition?
Yes, you can use ecr.m, which is a generalized latent variable regression. It has one parameter called alpha:
1, when alpha =0, it is principal component regression (PCR)
2, when alpha=1, it is PLS
3, when 0
Dear Yizeng Liang
I wonder if you would be kind enough to answer the following question of mine.
In your paper entitled “Key Wavelengths screening using competitive adaptive reweighted sampling method for multivariative calibrationˮ published in Analytica Chimica Acta 648 (2009) 77-84 on page 78 in section 2.4. (Exponentially decreasing function) it had stated that “……, (II) in the Nth sampling run, only two wavelengths are reserved such that we have rN = 2/p.ˮ
My question is that “Why is it that in the Nth sampling run, two wavelengths are reserved (and not, for instance, one or three wavelengths)? Please explain more.
With best wishes and regards
For multivariate regression models, I think 2 is the minimum for a model to be considered to be multivariate.
Dear Yizeng Liang
I will use your toolbox in my paper. Actually, I am new user in Matlab. I use the Unscrambler, CWS Senologic etc for spectral analyses. I would like to use CARS method for variable selection in my study. I tried it on my data (n=50 samples, p=101 and 4 variables). CARS gives the varsel [1×14 double] or similar. I don’t think so that there is a problem in my data set. Because I analyzed the same data set in Matlab using SPA toolbox. Maybe this is very simple problem but I am a new user for Matlab and I don’t have an idea about this case. If you could give a support, I will be glad.
F
Hi,
I was trying to use the Kennard-Stone algorithm to partition my dataset into training and testing. but the result showed the y values of test set were all the same class 2 ( the y values are classes of 1 and 2). why it happened this case? can i have some ways to solve it?
Use only X as the input of the KS algorithm, y is not considered in KS.
Hello! Thanks for this!
May I ask why your don’t apply the CARS algorithm to the full dataset first and then after apply the PLS to training set with the selected variables?
Thanks!
If you do in this way, the CARS will ‘see’ the important variables in the test set, thus leading to overfitting.
Thanks for your quick reply!
The reason I’m asking this is that I’m trying to apply the CARS algorithm to FTIR data and I’m having lower RMSEP with full-spectrum PLSR (2700 variables!!) that with CARS-PLSR for some responses.
Is it possible? I expected that CARS-PLS would give lower RSMEP in this case!
Thanks!
Any idea please?
Thanks!
Yes, very often, selected variables with CARS would give lower RMSEP. But your case can still happen. Maybe you can check the data preprocessing, as well as whether there are outliers in your model.