Join our mailing list for getting updates.

NEW: deliver your comments/questions to my Blog.

Updates:

1. Elastic Component Regression(ECR) added in version 1.95.

This library provides a whole set of easy-to-use functions for building partial least squares (PLS) regression (PLSR) and discriminant analysis (PLS-DA) models as well as predictive performance evaluation. Towards building a reliable model, we also implemented a number of commonly used *outlier detection* and *variable selection* methods that can be used to *"clean"* your data by removing potential outliers and using only a sub-set of selected variables.

The algorithms in the current version cover:

Types | Methods | Abbreviations | Codes | Notes |
---|---|---|---|---|

Data pretreat | mean-centering | pretreat.m | ||

autoscaling | pretreat.m | |||

Orthogonal Projection to Latent Structures | OPLS | opls.m | ||

Orthogonal Signal Correctionof Tom Fearn | OSC | oscfearn.m | ||

Orthogonal Signal Correction of Swante Wold | OSC | oscwold.m | ||

Sample partition | Kennard-Stone algorithm | KS | ks.m | |

Model building | Partial Least Squares | PLS | pls.m | |

Linear Discriminant Analysis | LDA | ldapinv.m | ||

Partial Least Squares-Linear Discriminant Analysis | PLS-DA | plslda.m | ||

Elastic Component Regression | ECR | ecr.m | ||

Model assessment | leave-one-out cross validation | LOOCV | plscv.m, plsldacv.m | |

K-fold cross validation | K-fold CV | plscv.m, plsldacv.m, ecrcv.m | ||

double cross validation | DCV | plsdcv.m, plsldadcv.m | ||

Monte Carlo cross validation | MCCV | plsmccv.m, plsldamccv.m | ||

repeated double cross validation | RDCV | plsrdcv.m, plsldardcv.m | ||

Using an independent test set | ||||

Outlier detection | The Monte Carlo method | mcs.m | ||

Variable selection | Variable Importance in Projection | VIP | inside pls.m or plslda.m | |

Target Projection | TP | inside pls.m or plslda.m | ||

Uninformative Variable Elimination | UVE | mcuvepls.m, mcuveplslda.m | ||

Competitive Adaptive Reweighted Sampling | CARS | carspls.m, carsplalda.m | ||

Random Frog | randomfrog_pls.m, randomfrog_plslda.m | |||

interval Random Frog | iRF | irf.m | ||

Subwindow Permutation Analysis | SPA | spa.m | ||

Moving Window Partial Least Squares | MWPLS | mwpls.m | ||

the Phase Diagram algorithm | PHADIA | phadia.m | ||

Iteratively Retain Informative Variables | IRIV | iriv.m | ||

Variable Complementary Network | VCN | vcn.m |

To build a credible model for a given chemical or biological or clinical data, it may be helpful to first get somewhat better insight into the data itself before modeling and then to present the statistically stable results derived from a large number of sub-models established only on one dataset with the aid of Monte Carlo Sampling (MCS). We proposed a new concept Model Population Analysis (MPA), which is a general framework for designing new data analysis methods by statistically analyzing user-interested outputs (regression coefficients, prediction errors etc) of a number of sub-models generated by introducing data variation in samples or variables or both. New methods are expected to be developed by making full use of the interesting parameter in a novel manner. As described in the left figure, the output of a population of sub-models can be put into **four spaces: sample space, variable space, parameter space and model space**, which could serve as a guide for algorithm development.

The concept of MPA was originally proposed in J. Chemometr., 24 (2009) 418, and systmatically elucidated and reviewed in TrAC 38 (2012)154-162.

A series of MPA-based methods are available in the libPLS package, which include:

- Subwindow Permutation Analysis: variable selection for classification models; output a variable-interaction-incorporated P-value for assessing the synergistically statistical importance of each variable; this P-value is minus log10-transformed to COSS score; another statistic, called DMEAN, is also provided to assess the effect size of each vairable in predictions.
- Margin Influence Analysis: variable selection specififically designed for support vector machines, output a variable-interaction-incorporated P-value for assessing the synergistically statistical importance of each variable, this P-value is minus log10-transformed to COSS score; for assessing the synergisticaly statistical importance of each variable., this P-value is minus log10-transformed to COSS score ; another statistic, called DMEAN, is also provided to assess the effect size of each vairable in predictions.
- Random Frog: variable selection for both classification and regression, output a selection probability for each variable. In addition, interval Random Frog method was developed for spectral data only
- Monte Carlo Uninformative Variable Elimination: variable elimination for both classification and regression; output a reliability index for each vairable.
- Variable Complementary Network: building Variable Complementary Networks and simultaneously assessing variable importance; output a Total Complementary Information for each variable for importance evaluation.
- The Phase Diagram method: variable selection and visulization for classification; output a diagnostic plot showing which variables are important; output a variable-interaction-incorporated P-value for assessing the synergistically statistical importance of each variable. This method is an extension of MIA and SPA.
- Iteratively Retains Informative Variables: variable selection in regression.
- The Monte Carlo method: outlier detection in regression;output a diagnostic plot showing which samples are likely to be outliers.

A systematic introduction of the MPA idea can be found in our presentation [PDF] .

1. Wold, S., M. Sjöström, and L. Eriksson, 2001. PLS-regression: a basic tool of chemometrics. Chemometr. Intell. Lab. 58 (2001)109-130. PDF

2. Kennard, R.W. and L.A. Stone, 1969. Computer aided design of experiments. Technometrics 11 (1969)137-148. PDF

3. Shao, J., 1993. Linear Model Selection by Cross-Validation. J Am. Stat. Assoc. 88 (1993)486-494. PDF

4. Xu, Q.-S. and Y.-Z. Liang, 2001. Monte Carlo cross validation. Chemometr. Intell. Lab. 56 (2001)1-11. PDF

5. Filzmoser, P., B. Liebmann, and K. Varmuza, 2009. Repeated double cross validation. J Chemometr 23 (2009)160-171. PDF

6. Cao, D.S., Y.Z. Liang, Q.S. Xu, H.D. Li, and X. Chen, A New Strategy of Outlier Detection for QSAR/QSPR. J Comput Chem 31 592-602.PDF

7. Centner, V., D.-L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, and C. Sterna, 1996. Elimination of Uninformative Variables for Multivariate Calibration. Anal. Chem. 68 (1996)3851-3858. PDF

8. Cai, W., Y. Li, and X. Shao, 2008. A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra. Chemometr. Intell. Lab. 90 (2008)188-194. PDF

9. Rajalahti, T., R. Arneberg, A.C. Kroksveen, M. Berle, K.-M. Myhr, and O.M. Kvalheim, 2009. Discriminating Variable Test and Selectivity Ratio Plot: Quantitative Tools for Interpretation and Variable (Biomarker) Selection in Complex Spectral or Chromatographic Profiles. Anal. Chem. 81 (2009)2581-2590. PDF

10. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2009. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta 648 (2009)77-84. PDF

11. Li, H.-D., Q.-S. Xu, and Y.-Z. Liang, 2012. Random Frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for gene selection and disease classification. Anal Chim Acta 740 (2012)20-26. PDF

12. Jiang, J.-H., R.J. Berry, H.W. Siesler, and Y. Ozaki, 2002. Wavelength Interval Selection in Multicomponent Spectral Analysis by Moving Window Partial Least-Squares Regression with Applications to Mid-Infrared and Near-Infrared Spectroscopic Data. Anal. Chem. 74 (2002)3555-3565. PDF

13. Li, H.-D., Y.-Z. Liang, and Q.-S. Xu, 2010. Uncover the path from PCR to PLS via elastic component regression. Chemometr. Intell. Lab. 104 (2010)341-346. PDF

14. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2009. Model population analysis for variable selection. J. Chemometr. 24 (2009)418-423. PDF

15. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2012. Model population analysis and its applications in chemical and biological modeling. TrAC 38 (2012)154-162. PDF

16. Li H-D, Liang Y-Z, Xu Q-S et al. (2011) Recipe for Uncovering Predictive Genes using Support Vector Machines based on Model Population Analysis. IEEE/ACM T Comput Bi 8: 1633-1641.PDF

17. YH Yun, HD Li et al, An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 111, 2013,31-36. PDF

18. YH Yun, WT Wang et al, A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration, Analytica chimica acta 807, 2014, 36-43. PDF

19. HD Li, QS Xu, YZ Liang, A phase diagram for gene selection and disease classification, bioRxivdoi: 10.1101/002360. PDF

20. HD Li, QS Xu, W Zhang, YZ Liang, (2012) Variable Complementary Network: a novel approach for identifying biomarkers and their mutual associations. Metabolomics 8, 1218-1226 PDF

**How to cite?** if you use this library, please cite it as: *Li H.-D., Xu Q.-S., Liang Y.-Z. (2014) libPLS: An Integrated Library for Partial Least Squares Regression and Discriminant Analysis. PeerJ PrePrints 2:e190v1*, source codes available at www.libpls.net.

Please drop me a line at lhdcsu@gmail.com, if any questions.

libPLS: an Integrated Library for Partial Least Squares Regression and Discrimiannt Analysis. libPLS is under continuous development.