20

Consider linear regression with some regularization: E.g. Find $x$ that minimizes $||Ax - b||^2+\lambda||x||_1$

Usually, columns of A are standardized to have zero mean and unit norm, while $b$ is centered to have zero mean. I want to make sure if my understanding of the reason for standardizing and centering is correct.

By making the means of columns of $A$ and $b$ zero, we don't need an intercept term anymore. Otherwise, the objective would have been $||Ax-x_01-b||^2+\lambda||x||_1$. By making the norms of columns of A equal to 1, we remove the possibility of a case where just because one column of A has very high norm, it gets a low coefficient in $x$, which might lead us to conclude incorrectly that that column of A doesn't "explain" $x$ well.

This reasoning is not exactly rigorous but intuitively, is that the right way to think?

chl
  • 50,972
  • 18
  • 205
  • 364
rk2
  • 301
  • 1
  • 2
  • 4

1 Answers1

16

You are correct about zeroing the means of the columns of $A$ and $b$.

However, as for adjusting the norms of the columns of $A$, consider what would happen if you started out with a normed $A$, and all the elements of $x$ were of roughly the same magnitude. Then let us multiply one column by, say, $10^{-6}$. The corresponding element of $x$ would, in an unregularized regression, be increased by a factor of $10^6$. See what would happen to the regularization term? The regularization would, for all practical purposes, apply only to that one coefficient.

By norming the columns of $A$, we, writing intuitively, put them all on the same scale. Consequently, differences in the magnitudes of the elements of $x$ are directly related to the "wiggliness" of the explanatory function ($Ax$), which is, loosely speaking, what the regularization tries to control. Without it, a coefficient value of, e.g., 0.1 vs. another of 10.0 would tell you, in the absence of knowledge about $A$, nothing about which coefficient was contributing the most to the "wiggliness" of $Ax$. (For a linear function, like $Ax$, "wiggliness" is related to deviation from 0.)

To return to your explanation, if one column of $A$ has a very high norm, and for some reason gets a low coefficient in $x$, we would not conclude that the column of $A$ doesn't "explain" $x$ well. $A$ doesn't "explain" $x$ at all.

jbowman
  • 31,550
  • 8
  • 54
  • 107
  • Do you mean ``$x$ does not ''explain'' $A$ well``, and mean ``x does not ''explain'' $A$ at all``? $A$ is the data while $x$ is the model in this case. – user3813057 Jan 26 '18 at 03:12
  • @user3813057 - this was a question about regularization, and has nothing to do with explanatory power. $x$ would more usually be labeled $\beta$, $A$ would more usually be labeled $X$, and $b$ would be more usually labeled $y$. $x$ is not there to explain $A$ at all. – jbowman Jan 26 '18 at 23:37