4

I'm doing a series of PLS analysis to test the contribution of a set of environmental variables to the invertebrate community of a river.

  • I am introducing into the model 23 environmental variables (2 of them are dummy variables)
  • The invertebrate community is introduced into the PLS analysis as a biological metric (e.g. diversity index)

My doubt is if it is necessary to do an a priori selection of the variables, before doing the PLS analysis. I read that the PLS is unaffected by data collinearity but multicollinear variables or dummy variables could affect the robustness of the analysis... isn't it?

As you see I am not an expert on statistics!!! so, should I do this variable selection by for example performing an RDA constraining biological community (species) to the environmental variables?

amoeba
  • 93,463
  • 28
  • 275
  • 317

1 Answers1

1

That depends, not only for PLS but also for almost any chemometric/machine learning method.

There is always a risk of any model giving a high "weight" to a variable which has actually no correlation with the responses. In other words, what you are measuring may have nothing to do with what you are observing but they may behave together by chance.

Thus, more sample means less risk. Looking at the correlation of each variable to the responses may help in some cases yet I often find it useless. Moreover, which combination of variables yields to the best model is a hard question to answer exactly since the number of models (developed with each combination of variables) increases VERY fast as the number of features increase.

To get the intuition, you may try to introduce a few new variables to your data set that are completely random. Interestingly, PLS will probably assign some weight to them as well.

So yeah, prior to PLS, you may try feature selection. You may also use, for instance, genetic algorithm which employs PLS itself during feature selection. It all comes to your needs, your data, and your understanding of the data.

Edit: In the light of the discussion in the comments I would like to give further information. In my opinion, feature selection is a tricky business in general.

Removing some variables based on prior knowledge can be useful. If, for example, one variable contains some measurement noises, removing it might improve your model.

Use of feature selection algorithms before PLS can be useful too. The problem is, cross-validation errors and autoprediction errors can be deceiving. I have encountered many times that as a feature selection algorithm selects less number of variables CV errors decreases as well as autopredictions errors. In reality, the models were overfitting most of the time.

Comparing autoprediction, cross-validation predictions and independent validation set predictions among the models is the way to go for me. If the algorithm, such as genetic algorithm as mentioned in comments, doesn't allow the calculation of comparable CV errors then experimenting on that type of data is necessary before blindly relying on that algorithm.

gunakkoc
  • 1,382
  • 1
  • 10
  • 23
  • When you say "feature selection", do you mean feature selection based on X alone or based on the relationships (e.g. strength of correlation) between X and Y? If the latter, then it's usually considered to be a bad idea (http://stats.stackexchange.com/questions/20836)... – amoeba Jan 26 '17 at 09:34
  • 1
    I mean both. If one knows that a feature is not reliable(noise in a particular wavelength in a spectra for example) removing it might help and I personally would give it a try. Feature selection algorithms like GA relies on both X and Y. Should I make it more clear in the answer? – gunakkoc Jan 26 '17 at 09:38
  • It's just that I would be suspicious of any feature-selection that is based on both X and Y (see the link I added to my previous comment). At least one should *always* put feature selection *inside* cross-validation; then I guess it's okay, and "feature-selection-strength" can be considered one of the hyperparameters to optimize. – amoeba Jan 26 '17 at 09:40
  • But I have no idea how it usually works for GA. Would be interested to know. – amoeba Jan 26 '17 at 09:41
  • While there are many variations of GA, all the ones I know uses CV. It basically starts with many random set of features then shuffles and mixes them while allowing the sets with a lower RMS error from CV to be choosen for the new sets more likely. It tries to mimic Darwin's evolution where the fitness is 1/RMSE, gene is a set of features and shuffling part is similar to cross-over in miosis. Each iteration on the original genes slowly increases their fitness and less fit ones are removed. Of course this is a small summary which I hope explains the intuition as well as existence of CV. – gunakkoc Jan 26 '17 at 09:53
  • But this means that GA is not "inside" CV? I mean, how can we estimate the out-of-sample performance of this whole procedure? – amoeba Jan 26 '17 at 10:00
  • There is RMSE calculated via CV for the features selected by GA. There is no CV for GA itself, but CV for selected variables selected BY GA. I mean it is possible but removing n samples, running GA and then loopiing is a thing that I have never encountered. It sounds not feasible to me but you triggered my curiosity. I will give it a try in my spare time. – gunakkoc Jan 26 '17 at 10:12
  • 1
    To me using GA on top of CV (i.e. using CV results to tinker with the model, re-run CV, tinker with the model again, etc.) sounds like a recipe for disaster. One can badly overfit. – amoeba Jan 26 '17 at 10:13
  • 1
    Perfect point! To me, what seperates GA from other methods of the type you mentioned is the embedded limitations. Limiting max and min number of variables, the cross-over step that changes only a portion of the features of a "gene", and choosing the pair of genes for cross over in biased way but not totally systematical prevents those problems. In fact, for spectral data, I usually achieve better results by GA combined with OLS than any other method alone. – gunakkoc Jan 26 '17 at 10:23
  • That's interesting. But how do you know if the results are "better" or not if you do not have an outside CV loop? Do you use a test set that is untouched during this GA model fitting procedure? – amoeba Jan 26 '17 at 10:25
  • 1
    Exactly. If both autoprediction and independent data set predictions are good then I use that model. Otherwise I move on to traditional methods. I don't repeat GA until I obtain good validation set predictions, of course. – gunakkoc Jan 26 '17 at 10:35
  • I see. I guess you could transfer some parts of this discussion into your answer... I will upvote (+1) in advance. – amoeba Jan 26 '17 at 10:37
  • Btw, I see your point and it was bugging me for a while too that is relying on autopredictions errors which means not much and validations errors which may end up to be good just by chance and lacking CV errors for comparison. It felt less safe to me too. As I observed consistently better models of GA, however, I started to use.it more frequently. Finally, I will surely transfer them to answer as much as I can. – gunakkoc Jan 26 '17 at 10:43
  • If you make the feature selection (even GA) a part of the fitting procedure and run CV on top of that, there is no way in which the CV errors could be deceiving. –  Mar 29 '18 at 07:07