38

I have read three main reasons for standardising variables before something such as Lasso regression:

1) Interpretability of coefficients.

2) Ability to rank the coefficient importance by the relative magnitude of post-shrinkage coefficient estimates.

3) No need for intercept.

But I am wondering about the most important point. Do we have reason to think that standardisation would improve the out of sample generalisation of the model? Also I don't care if I don't need an intercept in my model; adding one doesn't hurt me.

Jase
  • 1,904
  • 3
  • 20
  • 33
  • 1
    Clarification: you seem to want to ask, "Provided that standardization is optional (one of the special cases where the results are not skewed by different magnitudes), then will standardization improve out-of-sample generalization?" Is this correct? – Drew75 Feb 13 '14 at 15:14
  • @Drew75 I prefer a breakdown of cases e.g. does it help when the results are "skewed by different magnitudes", does it help when the results aren't skewed, et cetera, the best answer will cover different situations. – Jase Feb 13 '14 at 16:33
  • 1
    Then your question isn't about Lasso (because in general standardization is necessary before Lasso). It's more general. Perhaps change the title and the first sentence of the question. – Drew75 Feb 13 '14 at 18:25
  • @Drew: That's rather question-begging: Why's it necessary (when isn't it?)? What does it mean to skew the results (compared to what?)? I think the question's fine as it stands. – Scortchi - Reinstate Monica Feb 13 '14 at 20:46
  • @Drew75 My question is about Lasso. – Jase Feb 14 '14 at 03:31

3 Answers3

30

Lasso regression puts constraints on the size of the coefficients associated to each variable. However, this value will depend on the magnitude of each variable. It is therefore necessary to center and reduce, or standardize, the variables.

The result of centering the variables means that there is no longer an intercept. This applies equally to ridge regression, by the way.

Another good explanation is this post: Need for centering and standardizing data in regression

Drew75
  • 1,115
  • 9
  • 12
  • This is either not an answer or an extremely indirect answer to my question. Please explain the link between your answer and out of sample generalisation (which was the question). – Jase Feb 13 '14 at 08:57
  • 12
    @Jase: It does address the *main* reason for standardization, which you omitted from your list: if you want to drop predictors with small coefficients (or otherwise use a penalty term depending on coefficient magnitude), you need to decide what counts as "small". Though standardization isn't mandatory before LASSO or other penalized regression methods, it's rarely the case that the original scales the predictors happen to be measured in are useful for this purpose. – Scortchi - Reinstate Monica Feb 13 '14 at 09:36
  • 3
    And the point about centring is that you don't usually want to drop or shrink the intercept. – Scortchi - Reinstate Monica Feb 13 '14 at 09:38
  • @Scortchi What do you mean by `drop predictors with small coefficient [estimates]`? Don't we just do a grid search on $\lambda$ to determine the optimal number of non-zero coefficients, and you can drop the tiniest ones out by simply pushing $\lambda$ a bit? I am also yet to see how this relates to out of sample model generalization what is what the question is about. – Jase Feb 13 '14 at 12:49
  • @Drew75 You saying "integral part of the estimation" doesn't actually mean much to me. I care about generalisation, do you have anything to say on this point (empirical studies or a theoretical result)? In addition Tibshirani et al says that you don't need to standardize in some circumstances even if they're not $N(0,1)$ so clearly he doesn't agree with your view (read the glmnet notes). – Jase Feb 13 '14 at 12:52
  • 2
    @Jase: Yes that's what I mean (assuming $\lambda$'s the shrinkage parameter). And whether a coefficient estimate's among the tiniest (however you choose \lambda$) depends on whether it's measured in kilometres, micrometres, the no. standard deviations from its mean value in the sample, or some other unit. From a Bayesian viewpoint you're putting weakly informative priors over the true coefficient values, not uninformative ones. – Scortchi - Reinstate Monica Feb 13 '14 at 14:36
  • 2
    Very broadly, how much you shrink *overall* is going to affect generalization to random hold-out samples; the somewhat arbitrary decision how much to shrink *each* predictor relative to the others is going to affect generalization to new samples from similar populations, where the coefficients are a bit different, where the distribution of predictors isn't necessarily much like that in the training set, &c. (Of course your question deserves a more fully thought-out answer.) – Scortchi - Reinstate Monica Feb 13 '14 at 14:37
  • 1
    Is there any examples that **show that if you DON'T normalize, things go wrong?** – ArtificiallyIntelligence Oct 22 '18 at 17:32
2

The L1 penalty parameter is a summation of absolute beta terms. If the variables are all of different dimensionality then this term is really not additive even though mathematically there isn't any error.

However, I don't see the dummy/ categorical variables suffering from this issue and think they need not be standardized. standardizing these may just reduce interpretability of variables

Sumit Dhar
  • 21
  • 1
1

If by standardize you mean transform all variables to z-scores (as is often the case), then you may want to consider that z-scoring a pre-scaled dataset may result in amplification of noise. That is--variables with low variance may have measurement noise amplified more so after applying z-scoring.