CSU

libPLS: an Integrated Library for Partial Least Squares Regression and Linear Discriminant Analysis

Featured in Model Population Aanlysis (MPA) approaches


Join our mailing list for getting updates.

NEW: deliver your comments/questions to my Blog.

Updates:

3. The libPLS paper is published in Chemom. Intell. Lab. Syst.
2. Customer prior is allowed in LDA to weigh each class of samples in version 1.98.
1. Elastic Component Regression(ECR) added in version 1.95.

1. Overview

This library provides a whole set of easy-to-use functions for building partial least squares (PLS) regression (PLSR) and discriminant analysis (PLS-DA) models as well as predictive performance evaluation. Towards building a reliable model, we also implemented a number of commonly used outlier detection and variable selection methods that can be used to "clean" your data by removing potential outliers and using only a sub-set of selected variables.

The algorithms in the current version cover:

Types MethodsAbbreviationsCodesNotes
Data pretreatmean-centeringpretreat.m
autoscalingpretreat.m
Orthogonal Projection to Latent StructuresOPLSopls.m
Orthogonal Signal Correctionof Tom FearnOSCoscfearn.m
Orthogonal Signal Correction of Swante WoldOSCoscwold.m
Sample partitionKennard-Stone algorithmKSks.m
Model buildingPartial Least SquaresPLSpls.m
Linear Discriminant AnalysisLDAldapinv.m
Partial Least Squares-Linear Discriminant AnalysisPLS-DAplslda.m
Elastic Component RegressionECRecr.m
Model assessmentleave-one-out cross validationLOOCVplscv.m, plsldacv.m
K-fold cross validationK-fold CVplscv.m, plsldacv.m, ecrcv.m
double cross validationDCVplsdcv.m, plsldadcv.m
Monte Carlo cross validationMCCVplsmccv.m, plsldamccv.m
Using an independent test set
Outlier detectionThe Monte Carlo methodmcs.m
Variable selectionVariable Importance in ProjectionVIPinside pls.m or plslda.m
Target ProjectionTPinside pls.m or plslda.m
Uninformative Variable EliminationUVEmcuvepls.m, mcuveplslda.m
Competitive Adaptive Reweighted SamplingCARScarspls.m, carsplalda.m
Random Frograndomfrog_pls.m, randomfrog_plslda.m
interval Random FrogiRFirf.m
Subwindow Permutation AnalysisSPAspa.m
Moving Window Partial Least SquaresMWPLSmwpls.m
the Phase Diagram algorithmPHADIAphadia.m
Iteratively Retain Informative VariablesIRIViriv.m
Variable Complementary NetworkVCNvcn.m

2. Model Population Analysis (MPA)

MPA

To build a credible model for a given chemical or biological or clinical data, it may be helpful to first get somewhat better insight into the data itself before modeling and then to present the statistically stable results derived from a large number of sub-models established only on one dataset with the aid of Monte Carlo Sampling (MCS). We proposed a new concept Model Population Analysis (MPA), which is a general framework for designing new data analysis methods by statistically analyzing user-interested outputs (regression coefficients, prediction errors etc) of a number of sub-models generated by introducing data variation in samples or variables or both. New methods are expected to be developed by making full use of the interesting parameter in a novel manner. As described in the left figure, the output of a population of sub-models can be put into four spaces: sample space, variable space, parameter space and model space, which could serve as a guide for algorithm development.

The concept of MPA was originally proposed in J. Chemometr., 24 (2009) 418, and systmatically elucidated and reviewed in TrAC 38 (2012)154-162.

A series of MPA-based methods are available in the libPLS package, which include:

  1. Subwindow Permutation Analysis: variable selection for classification models; output a variable-interaction-incorporated P-value for assessing the synergistically statistical importance of each variable; this P-value is minus log10-transformed to COSS score; another statistic, called DMEAN, is also provided to assess the effect size of each vairable in predictions.
  2. Margin Influence Analysis: variable selection specififically designed for support vector machines, output a variable-interaction-incorporated P-value for assessing the synergistically statistical importance of each variable, this P-value is minus log10-transformed to COSS score; for assessing the synergisticaly statistical importance of each variable., this P-value is minus log10-transformed to COSS score ; another statistic, called DMEAN, is also provided to assess the effect size of each vairable in predictions.
  3. Random Frog: variable selection for both classification and regression, output a selection probability for each variable. In addition, interval Random Frog method was developed for spectral data only
  4. Monte Carlo Uninformative Variable Elimination: variable elimination for both classification and regression; output a reliability index for each vairable.
  5. Variable Complementary Network: building Variable Complementary Networks and simultaneously assessing variable importance; output a Total Complementary Information for each variable for importance evaluation.
  6. The Phase Diagram method: variable selection and visulization for classification; output a diagnostic plot showing which variables are important; output a variable-interaction-incorporated P-value for assessing the synergistically statistical importance of each variable. This method is an extension of MIA and SPA.
  7. Iteratively Retains Informative Variables: variable selection in regression.
  8. The Monte Carlo method: outlier detection in regression;output a diagnostic plot showing which samples are likely to be outliers.

A systematic introduction of the MPA idea can be found in our presentation [PDF] .

3. References

1. Wold, S., M. Sjöström, and L. Eriksson, 2001. PLS-regression: a basic tool of chemometrics. Chemometr. Intell. Lab. 58 (2001)109-130. PDF
2. Kennard, R.W. and L.A. Stone, 1969. Computer aided design of experiments. Technometrics 11 (1969)137-148. PDF
3. Shao, J., 1993. Linear Model Selection by Cross-Validation. J Am. Stat. Assoc. 88 (1993)486-494. PDF
4. Xu, Q.-S. and Y.-Z. Liang, 2001. Monte Carlo cross validation. Chemometr. Intell. Lab. 56 (2001)1-11. PDF
5. Filzmoser, P., B. Liebmann, and K. Varmuza, 2009. Repeated double cross validation. J Chemometr 23 (2009)160-171. PDF
6. Cao, D.S., Y.Z. Liang, Q.S. Xu, H.D. Li, and X. Chen, A New Strategy of Outlier Detection for QSAR/QSPR. J Comput Chem 31 592-602.PDF
7. Centner, V., D.-L. Massart, O.E. de Noord, S. de Jong, B.M. Vandeginste, and C. Sterna, 1996. Elimination of Uninformative Variables for Multivariate Calibration. Anal. Chem. 68 (1996)3851-3858. PDF
8. Cai, W., Y. Li, and X. Shao, 2008. A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra. Chemometr. Intell. Lab. 90 (2008)188-194. PDF
9. Rajalahti, T., R. Arneberg, A.C. Kroksveen, M. Berle, K.-M. Myhr, and O.M. Kvalheim, 2009. Discriminating Variable Test and Selectivity Ratio Plot: Quantitative Tools for Interpretation and Variable (Biomarker) Selection in Complex Spectral or Chromatographic Profiles. Anal. Chem. 81 (2009)2581-2590. PDF
10. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2009. Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. Anal. Chim. Acta 648 (2009)77-84. PDF
11. Li, H.-D., Q.-S. Xu, and Y.-Z. Liang, 2012. Random Frog: an efficient reversible jump Markov Chain Monte Carlo-like approach for gene selection and disease classification. Anal Chim Acta 740 (2012)20-26. PDF
12. Jiang, J.-H., R.J. Berry, H.W. Siesler, and Y. Ozaki, 2002. Wavelength Interval Selection in Multicomponent Spectral Analysis by Moving Window Partial Least-Squares Regression with Applications to Mid-Infrared and Near-Infrared Spectroscopic Data. Anal. Chem. 74 (2002)3555-3565. PDF
13. Li, H.-D., Y.-Z. Liang, and Q.-S. Xu, 2010. Uncover the path from PCR to PLS via elastic component regression. Chemometr. Intell. Lab. 104 (2010)341-346. PDF
14. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2009. Model population analysis for variable selection. J. Chemometr. 24 (2009)418-423. PDF
15. Li, H.-D., Y.-Z. Liang, Q.-S. Xu, and D.-S. Cao, 2012. Model population analysis and its applications in chemical and biological modeling. TrAC 38 (2012)154-162. PDF
16. Li H-D, Liang Y-Z, Xu Q-S et al. (2011) Recipe for Uncovering Predictive Genes using Support Vector Machines based on Model Population Analysis. IEEE/ACM T Comput Bi 8: 1633-1641.PDF
17. YH Yun, HD Li et al, An efficient method of wavelength interval selection based on random frog for multivariate spectral calibration, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 111, 2013,31-36. PDF
18. YH Yun, WT Wang et al, A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration, Analytica chimica acta 807, 2014, 36-43. PDF
19. HD Li, QS Xu, YZ Liang, A phase diagram for gene selection and disease classification, bioRxivdoi: 10.1101/002360. PDF
20. HD Li, QS Xu, W Zhang, YZ Liang, (2012) Variable Complementary Network: a novel approach for identifying biomarkers and their mutual associations. Metabolomics 8, 1218-1226 PDF

How to cite?

if you use this library, please cite it as: Li H.-D., Xu Q.-S., Liang Y.-Z., libPLS: an integrated library for partial least squares regression and discriminant analysis. Chemom. Intell. Lab. Syst, 2018, 176,34-43


Contact

Please drop me a line at lhdcsu@gmail.com, if any questions.




libPLS: an Integrated Library for Partial Least Squares Regression and Discrimiannt Analysis. libPLS is under continuous development.