CARS, Competitive Adaptive Reweighted Sampling, is a variable or feature selection methods, especially suitable for p>>n setting and data with high correlation (coupled with PLS). Here I will give an example to detail how to use CARS and the meaning of its output.
Suppose that you have MATLAB installed (I am using 2010a version on Mac), downloaded the libPLS_1.95 and have added libPLS_1.95 in the search path.
First, let’s get our data. Copy the codes below to the command window and press ENTER:
%+++ load data
Keep in mind that this data has n=80 samples which were measured on p=700 wavelengths. The target variable to model is stored in vector y. In this p>>n setting, ordinary least squares (OLS) can not be used to build a model y=Xb. Why? because b is calculated as b=inv(X’*X)X’y (inv indicates inverse, ‘ indicates transpose) in OLS. Due to p > n, X’*X is not full rank and hence do NOT have its inverse and therefore OLS fails in this situation. One type of solution is to use regularized methods such as ridge regression or latent variable-based methods such as principal component regression (PCR) and partial least squares (PLS). Why not simply reduce p? A good idea! This motivates another type of solution which is the well known branch of statistics: variable selection or called feature selection.
Then, we split the data into a training and a test set, which will be used for model training and validation, respectively. There are different ways to split data, such as the KS (Kennard-Stone) method and random partition. Here, I will randomly split the data using the codes below:
%+++ split data into a training and test set
ktrain=perm(1:60); % 60 samples as training
ktest=perm(61:80); % 20 samples as test
At this point, data is ready! We can start to select a subset of variables using the CARS method. One may ask why not use the full spectral data? A good question to think about. Run the codes below to select variables (wavelengths):
%+++ perform variable selection using only the training data
A=10; % maximum number of PLS components
fold=5; % fold for cross validation
method=’center’; % for internal pretreatment
num=50; % number of iterations of CARS, 50 is default
To see the output of CARS, type “CARS” in command window and press ENTER and you will see:
W: [700×50 double]
RMSECV: [1×50 double]
ratio: [1×50 double]
vsel: [405 505]
Technically, CARS is a structural data with different components like optLV as shown above. Your can use CARS.RMSECV to fetch the value of RMSECV. The meaning of each component of CARS is given here:
W: [700×50 double] % weight of each variable at each iteration.
RMSECV: [1×50 double] % Root Mean Squares Errors of Cross Validation, one for each iteration
RMSECV_min: 3.0894e-04 % the minimum of RMSECV
Q2_max: 1.0000 % Q2 is the cross validated R2.
iterOPT: 49 % the optimal iteration number (corresponding to the RMSECV_min)
optLV: 2 % the optimal number of Latent Variables of PLS model at iterOPT.
ratio: [1×50 double]
vsel: [405 505] % The selected variables. THIS IS WHAT WE WANT.
Recall that this data contains p=700 variables. Here only two out of the 700 were selected to be useful by CARS: the 405th and the 505th. To understand the behavior of CARS, let’s issue the following command:
which gives you a plot:
Figure 1. Output of CARS.
In Figure 1, the upper panel shows the number of variables kept by CARS at each iteration; the middle panel plots the the RMSECV at each iteration, which shows that RMSECV decreases with reduced number of variables (interesting? why?); the bottom panel provides the regression coefficients of each of the 700 variables at each iteration, called Regression Coefficient Paths (RCP) where each line denotes the RCP of one variable. Eliminated variables will have RCP ending with zero (thus you can not see it); some variables will have their RCP first increase and then decrease and finally drop to zero (notice iterations around 30); some variables survived along the whole 50 iterations of CARS and are winners and are selected by CARS. Notice that there two lines (just the 405th and 505th variables) clearly stand out!
We have selected 2 variables. Then let’s check out how only these two will perform. The variable-reduced data is:
The we build a model and test it:
An intuitive way to check model performance is to plot predicted value against the experimental value:
This gives you a plot:
You want to give a quantitative measurement of model performance. Usually, you calculate RMSEP (Root Mean Squared Errors of Prediction):
which is 3.0285e-04 in this case, very small (this data is very special).
So far, we have selected a subset of variables (405 and 505) using CARS and tested the performance of this subset on an independent test set. You may ask whether using a subset will improve model performance over the full-spectral model? It will as long as the full-model contains information that is irrelevant to the target variable you are predicting. As a test case, you can build a model using the full training data and tested it on the full test set and compares it to the simpler model with only two selected variables.