2

Coming from a machine learning background, I have long held the idea -throw in all variables and let regularization and cross validation fight against over-fitting.

The reason I am posting this arises from a recent study using principal component regression. Intuitively, increasing the PCR parameter (% of variance to keep) roughly amounts to removing variables. So a natural approach to take is throw in all variables and perform cross validation on the PCR parameter.

However, this approach proved sub-optimal; a later experiment showed that removing certain variables improved prediction across the board and makes prediction more stable across PCR parameter (shifting the learning curve up). This phenomenon still baffles me today.

Can anyone please comment on this, from a theoretical and applied perspective? In general, when you only care about prediction, would you consider variable selection?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 2
    An incredible amount of information is available on that topic on this site. Very short answer: removing variables in a way that uses supervised learning (i.e., that uses associations with $Y$) is not a very good idea. Increased accuracy is usually a mirage and will not withstand rigorous bootstrap validation. Consider sparse principal components and many other good approaches, especially penalized maximum likelihood estimation. – Frank Harrell Jun 29 '17 at 21:47
  • @FrankHarrell I guess i am more concerned with the reason why removing variables not that correlated with Y improves prediction even after thorough cross validation. – denizen of the north Jun 29 '17 at 21:52
  • One example in a common application--multiple regression--is described at https://stats.stackexchange.com/a/34813/919. My analysis of how a regression could be significant, yet individual variables be not, concluded that one possible reason is that adding additional variables (not correlated with any of the explanatory or response variables) just puts "noise" into the data which "masks" the significance of the explanatory variables that really matter. In effect, the more variables you throw in, the likelier it is that you will find "significant" relationships by chance. – whuber Jun 29 '17 at 22:04
  • Incidentally, experience shows that regularization and cross validation are not a panacea: you still have to think hard about which variables to offer to these procedures and you will find that your choice of variables to work with can profoundly affect which variables ultimately work their way into the model. "Parsimony is valuable." – whuber Jun 29 '17 at 22:07
  • For some reason, i cant stopping myself from thinking about the digit recognition problem using images consisting 400(20*20) pixels. It is impossible that all pixels are relevant. However, a neural network trained taking all pixels generates good results. There is no variable selection in this application, but the prediction result is still quite good. Would you perform variable selection in this case? – denizen of the north Jun 29 '17 at 22:15
  • Look at it in reverse: for training the model, would you embed the images within larger backgrounds (chosen in any fashion) on the theory that the increase in the number of pixels couldn't hurt? – whuber Jun 29 '17 at 22:21
  • 2
    @whuber You are right. Summing this up in statistical learning theory terms, including irrelevant variables increases generalization error, increasing the chance of over-fitting. Does it sound about right? – denizen of the north Jun 29 '17 at 22:27
  • Yes, that's a nice summary. – whuber Jun 29 '17 at 22:30
  • @whuber well, isnt regularization designed solely for that purpose? – denizen of the north Jun 29 '17 at 22:34
  • 2
    Not so sure. That would be true if you knew apriori which variables are 'irrelevant'. Using associations with $Y$ to tell you which variables to exclude is akin to Maxwell's demon. It requires stealing information from the system to find out which features to use - information that might better be used for predictive accuracy. This is described in detail in my RMS course notes - see link at http://www.fharrell.com/p/blog-page.html and refer to section 4.3.1 – Frank Harrell Jun 29 '17 at 22:35
  • I don't know pcr too well, but it is not the same as ridge regression : with ridge regression you penalise weights/inputs that have too small an effect on error. Pcr is not selecting variables. So the point is that pcr is removing wrong variables. So I believe the conflict you are having is that pcr should be better variable selection, whereas ridge regression /lasso are the 'better' methods. – seanv507 Jun 29 '17 at 22:36
  • 1
    The image recognition example is not so straightforward. Afaik, most nns use a ridge regression penaliser (weight Decay) in addition you initialize the weights around zero and stop training when validation error increases (so this stopped training is also regularisation). Furthermore the pixels are correlated, so rather than variable selection to remove spurious pixels, they are all telling you the same thing... This is where ridge regression (or pcr) helps by avoiding large equal opposite weights. Similarly for stopped training (since you stop before weights get too big) – seanv507 Jun 29 '17 at 22:53
  • @seanv507 Great explanation. So it sounds like regularization and variable selection are solving the same problem- overfitting. – denizen of the north Jun 29 '17 at 23:19
  • No!! Variable selection is often a *cause* of overfitting. – Frank Harrell Jul 04 '17 at 14:06

0 Answers0