Choosing number of components in PLS - without minimum in RMSEP

Question

I use the plsr formula in R and the oscorespls algoritm for analysing my datasets. The datasets are characterized by relatively few number of observations (22), one response variable and different numbers of predictors (from only analysing 4 predictors and upwards). However, it is difficult to choose the number of components to use for the interpretation of the results. Cross-validation using leave-one-out is used, but the resulting RMSEP has several minimum values. The results vary a lot depending on the choice of number of components.

Hence, does anyone know how to do this choice in a scientific way?

These are the training results and RMSEP for one of the analyses:

VALIDATION: RMSEP
Cross-validated using 22 leave-one-out segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV           1.024   0.9789    1.012    1.033   0.9624   0.8659
adjCV        1.024   0.9767    1.007    1.025   0.9491   0.8554
       6 comps  7 comps  8 comps  9 comps
CV      0.8088   0.9736   0.9478   0.9790
adjCV   0.8006   0.9600   0.9330   0.9631

TRAINING: % variance explained
     1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
X      82.17    92.58    96.20    97.21    99.49    99.99   100.00
SEC    22.28    38.36    55.75    74.67    77.40    77.65    80.72
     8 comps  9 comps
X     100.00   100.00
SEC    87.86    88.72

I see a clear minimum in your cross-validated RMSEP measure at 6 components; so from this example it's not really clear what the problem ("several minimum values") was. — amoeba, Jul 06 '15 at 14:06
See recent work and Matlab and R codes at https://www.linkedin.com/pulse/wrtpls-solution-quantifying-quality-pls-model-every-tran-n-thanh — T. Tran, Jun 24 '17 at 19:47
@T.Tran Note the StackExchange policy on [self-promotion](https://stats.stackexchange.com/help/behavior) — Glen_b, Jun 25 '17 at 01:28

cbeleites unhappy with SX · Answer 1 · 2013-10-15T19:13:03.383

Welcome to cross validated!

Approach 1

Have a look at chapters 7.10 and 7.11 of The Elements of Statistical Learning.

I think the basic idea is to calculate the uncertainty on the test results for the different numbers of latent variables. That gives you an idea which differences you cannot trust to be real differences.

Do not forget that choosing the number of latent variables from test results is a data-driven model optimization, so you need an outer validation loop to measure the predictive performance of the model you obtain that way.

I'd also suggest to switch from LOO-cross validation to iterated/repeated $k$-fold cross validation or some version of out-of-bootstrap validation (see the book and answers here on cross validated to that topic).
You can also directly bootstrap the RMSE = f (# latent variables) plot.

Approach 2

Here's a second approach, that works very well for certain types of data: I work with spectroscopic data. Good spectra have a high correlation between neighbouring measurement channels, they look smooth in a parallel coordinate plot. For such data, I look at the X loadings. Similar to PCA loadings, higher PLS X loadings are usually more noisy than the first ones. So I decide the number of latent variables by looking how noisy the loadings are. For the data I deal with, this usually leads to far fewer latent variables than RMSECV (at least without calculating uncertainty) suggests.

Rule of Thumb

A rule of thumb I learned when I was first developing PLS models for industry as a student is: decide a number of PLS latent variables the way you learnt in lectures (e.g. with RMSE without uncertainty). Use at most 2 or 3 latent variables less than that would suggest.
My experience is that this rule of thumb did not only work for the UV/Vis data I had there, but also for other spectroscopic techniques.

Also, I find it very helpful to sit down and think about the application: what influencing factors do you expect and to how many components would that correspond. Again, this is not applicable to all kinds of problems and applications, but if you can take this approach it should give a reasonable starting point.

edit: references for approach 2

I know papers where we did it that way (for PCA, not PLS though), but IIRC we never showed the chosen loadings plus some noisy loadings we didn't choose, and we did not really discuss the criterium in detail. However:

Dochow, S. and Beleites, C. and Henkel, T. and Mayer, G. and Albert, J. and Clement, J. and Krafft, C. and Popp, J. Quartz microfluidic chip for tumour cell identification by Raman spectroscopy in combination with optical traps. Anal Bioanal Chem, 2013, 405, 2743-2746

[A] principal component analysis (PCA) model was calculated for the 21 background spectra and the first four principal components (without centring) were used to model these contributions. [...] Two further principal components did not have enough signal-to-noise-ratio to warrant inclusion into the model.

Thanks for the answer! I'm reading up on this and see different papers proposing different methods when it comes to the choice between LOO and the k-fold CV and bootstraping. Do you know of any R-script for k-fold CV? Also, for your "Approach 2", do you have any sources on this way of choosing the number of latent variables? — Johannes Morfeldt, Oct 14 '13 at 12:13
@JohannesMorfeldt: Andreas Alfons has a package for k-fold CV. k-fold CV is not per se better than LOO-CV. But LOO does not offer the possibility of iterating/repeating, so it does not allow you to measure model instability. See also the papers I linked in http://stats.stackexchange.com/a/71192/4598 for out-of-bootstrap vs. cross validation. They also discuss LOO — cbeleites unhappy with SX, Oct 15 '13 at 19:13

score 1 · Answer 2 · edited Jul 06 '15 at 13:13

This answer is probably too late, but for what its worth, here are my two cents.

As suggested, use 10-fold or 5-fold cv instead of leave-one-out as the results are better. Elements of Statistical Learning and Intro to Statistical learning explain the reason as being correlated datasets. Specifically, leaving out only one observation for each CV iteration means you are working with pretty much the same training set. In plsr, validation = "CV" runs a 10-fold cross-validation.
Based on your data, the minimum seems to be at 6 components but given the total number of components of 9, and the limited datasize, this might not be such a good solution. Hastie et al propose a one-standard-error rule which recommends going with the the smallest ncomp within one standard error of the absolute minimum.

Choosing number of components in PLS - without minimum in RMSEP

2 Answers2

Approach 1

Approach 2

Rule of Thumb