What does it mean to say that a regression method is (not) "scale invariant"?

Question

I was just studying partial least squares regression, and I read that it is "not scale invariant". What does "scale invariant" mean, and why is partial least squares, and why would a regression methodology in general be, not "scale invariant"? Some examples to better illustrate exactly what this means would be greatly appreciated.

What is the origin of this "not scale invariant" claim? It doesn't appear in your Wikipedia reference. — whuber, Sep 02 '20 at 16:02
@whuber Oops, I began doing research and forgot to add the original reference. It's from chapter **3.5.2 Partial Least Squares** of ESL, 2nd edition: https://web.stanford.edu/~hastie/ElemStatLearn/ — The Pointer, Sep 02 '20 at 16:05
That's very helpful, because the preceding clause, "like principal components regression," tells us the authors mean "not scale invariant" in the same sense as PCA. The results of PCA differ when variables are individually rescaled, such as when converting from covariance to correlation. See https://stats.stackexchange.com/questions/53 for a discussion here on CV. Because PLS is based (conceptually) on a form of PCA applied to all the variables (response and explanatory), the issues with PCA apply *mutatis mutandis* to PLS. — whuber, Sep 02 '20 at 16:10
@whuber Your explanation went over my head. Any chance you could post an answer that builds from a more basic level? — The Pointer, Sep 02 '20 at 16:12
"Individually rescaled" means choosing different units of measurement for any of the columns, that's all. When the result of your analysis depends on something arbitrary like whether you express a temperature in degrees C or degrees F, you have to watch out! — whuber, Sep 02 '20 at 17:10
@whuber What do you mean by "choosing different units of measurement for any of the columns"? Why does it matter whether we express a variable in, for instance, degrees celsius or degrees Fahrenheit? And I'm not sure that any of this actually explains what "scale invariant" means. — The Pointer, Sep 02 '20 at 17:18
"Scale invariant" means the solution doesn't depend on the units of measurement you choose. Thus, units of measurement don't matter *except* with non-scale invariant procedures! — whuber, Sep 02 '20 at 17:19
@whuber Ahh, ok, I see what you're saying. Hmm, that's weird, because one would assume that the units we use for variables is irrelevant for the statistical analysis; I mean, naively speaking, it just isn't obvious why this would matter. Why would some method of statistical analysis be dependent on the units we use (that is, why would some method of statistical analysis *not* be scale invariant)? According this, it seems that PCA, PCR, and PLS are all scale invariant, right? And so our statistical analysis would produce different results depending on, say, a variable in degrees C or degrees F? — The Pointer, Sep 02 '20 at 17:23
The above comment should say "... PCA, PCR, and PLS are all **not** scale invariant ..." — The Pointer, Sep 02 '20 at 17:35
The degrees C versus degrees F is a good example. I tend to use distances in "miles" versus "millimeters" for alliteration, but the point is the same. Why don't you write up an answer to your question now that you understand the principle? — EdM, Sep 02 '20 at 17:40
@EdM I'm not sure that I have an "understanding" of the principle, per se; hence my further question in the comment above. I think that there are others who would be able to post a much more informative answer than me. — The Pointer, Sep 02 '20 at 17:43
@whuber I was under the impression that scale invariant usually means invariant with respect to a dilation (a proper linear mapping, like $f(x) = kx$ for some constant $k$), such as the unit conversion from miles to millimeters that EdM suggested. The example of converting C to F is not a dilation, because it is an affine linear mapping like $f(x) = kx + b$ instead of a proper linear mapping. Invariance under affine linear mappings would imply both scale and shift invariance. — Eric Perkerson, Sep 02 '20 at 20:55
@eric That is correct. I was using temperature only in order to clarify a mathematical concept that I clearly and correctly articulated earlier but which had not been understood. — whuber, Sep 03 '20 at 14:11

score 3 · Accepted Answer · answered Sep 06 '20 at 01:06

3

Scale invariance means that rescaling any or all of the columns will not change the results - that is, multiplying or dividing all the values from any variable will not affect the model predictions (ref). As @ericperkeson mentioned, rescaling in this manner is known as dilation (ref). Scale invariance for metrics about contingency tables refers to rescaling rows as well as columns, though I don't believe it applies here (see the scaling property section here).

As to why PLSR is not scale invariant, I'm not completely certain, but I'll leave notes on what I've learned and possibly a better mathematician can clarify. Generally, regression with no regularisation (e.g. OLS) is scale invariant, and regularised regression (e.g. ridge regression) is not scale invariant, because the minimisers of the function change (ref).

Now, I can't see an explicit penalty term in PLSR, but I it's constrained in a similar way to PCA. PCA chooses the axes of maximal variance - so if you rescale a variable, the variance relative to other variables can change (ref). PLSR tries to find the ' multidimensional direction in the X space that explains the maximum multidimensional variance direction in the Y space', hence rescaling an input can change the direction of maximum variance (ref).

answered Sep 06 '20 at 01:06

Elenchus

521
1
5

https://en.wikipedia.org/wiki/Regularization_(mathematics) – The Pointer Sep 06 '20 at 23:15
Why would the minimisers of the function changing imply not being scale invariance? – The Pointer Sep 06 '20 at 23:29
To put it in univariate OLS terms, if the minimum sum of squares changes, the slope of the line found by the model will be different - so the result changes based on scaling. In OLS scaling won't affect the minimum sum of squares or the slope, but it regularised least squares it will. – Elenchus Sep 07 '20 at 00:45
"To put it in univariate OLS terms, if the minimum sum of squares changes, the slope of the line found by the model will be different - so the result changes based on scaling." Why does the minimum sum of squares changing mean that the slope of the line found by the model will be different? – The Pointer Sep 07 '20 at 00:50
That's how OLS works - it finds the line (or the hyperplane, in multivariate cases) with the minimum sum of squared distances to all the points in the data. From [wikipedia](https://en.wikipedia.org/wiki/Ordinary_least_squares), "the smaller the differences, the better the model fits the data" – Elenchus Sep 07 '20 at 00:59
Oh, yes, I see what you mean. But it isn't totally clear to me how scale invariance plays into this? – The Pointer Sep 07 '20 at 01:05
1

So looking at [the reference](https://roamanalytics.com/2016/11/17/translation-and-scaling-invariance-in-regression-models/), in regularised regression the minimisation is not just looking at the line, but the line and the penalty term. The penalty term is affected by scaling, so therefore the result of the minimisation function changes - and the coefficient of the line can change with that. That is, the optimum line constrained by the penalty term might be different when the penalty term can change based on scaling. Does that clear things up? – Elenchus Sep 07 '20 at 01:30
1

Yes, that makes sense! Thanks for the clarification. ericperkerson's introduction of the equations $f(x) = kx$ and $f(x) = kx + b$ really helps make everything more concrete. – The Pointer Sep 07 '20 at 01:59

EdM · Answer 2 · 2020-09-06T18:06:51.263

Start with the technical meanings of "location" and "scale" with respect to a one-dimensional probability distribution. The NIST handbook says:

A probability distribution is characterized by location and scale parameters ... a location parameter simply shifts the graph left or right on the horizontal axis ... The effect of the scale parameter [with a value greater than 1] is to stretch out the graph ... The standard form of any distribution is the form that has location parameter zero and scale parameter one.

Think of a data sample as a collection of empirical probability distributions for each of the predictors and outcomes. For the example in a comment, temperatures expressed either as degrees F or degrees C, there is a transformation with respect to both location and scale. Transformation from degrees C to degrees F changes the numerical values of degrees by a factor of $\frac {9}{5}$ (along with a subsequent location change of 32 degrees F). The variance of temperature values thus also changes by a factor of $\frac{81}{25}$. By "stretching out the graph," a transformation of the scale of a predictor changes the numerical values for the predictor and for its variance. Nevertheless, the underlying physical reality is the same.

With standard multiple regression, a change in the units of a predictor can be counterbalanced by a corresponding change in the units of the regression coefficients. If temperature in degrees C is a predictor in a model and you switch from degrees C to degrees F then (along with altering the intercept appropriately) you multiply the regression coefficient for temperature by a factor of $\frac{5}{9}$ and the model is the same. In that sense, the modeling process is "scale invariant." Similarly, correlation coefficients are scale invariant as the calculation corrects for the scales of the variables.

Regression modeling processes that differentially penalize predictors, in contrast, fundamentally depend on comparisons among the numerical values of the various predictors. That includes approaches like LASSO, ridge regression, principal components regression (PCR), and partial least squares (PLS). Say that both temperature and distance are predictors in a penalized model. In building the model you need to have a way to decide whether temperature or distance is relatively more important to weight in the model, yet all you have to work with is their numerical values. Those numerical comparisons between the temperature and distance predictor values will differ depending on whether temperature is expressed in degrees F or C, and on whether distances are expressed in miles or in millimeters. Such a modeling process is not scale invariant.

With respect to PCR and PLS, you can see this in the problems that they solve at each step, as expressed on page 81 of ESL, second edition:

... partial least squares seeks directions that have high variance [of predictors] and have high correlation with the response, in contrast to principal components regression which keys only on high variance... In particular, the $m$th principal component direction $v_m$ solves: $$ \operatorname{max}_\alpha \operatorname{Var}(\mathbf{X} \alpha) $$ $$ \text{subject to } \lVert \alpha \rVert =1,\: \alpha^T \mathbf{S} v_{\ell} =0, \: \ell =1,\dots,m−1,$$ where $\mathbf{S}$ is the sample covariance matrix of the [vectors of predictor values, indexed by $j$ for predictors] $\mathbf{x}_j$. The conditions $ \alpha^T \mathbf{S} v_{\ell} =0$ ensures that $\mathbf{z}_m = \mathbf{X} \alpha$ is uncorrelated with all the previous linear combinations $\mathbf{z}_{\ell} = \mathbf{X} v{_\ell}$. The $m$th PLS direction $\hat{\varphi}_m$ solves: $$\operatorname{max}_{\alpha} \operatorname{Corr}^2(\mathbf{y},\mathbf{X}\alpha)\operatorname{Var}(\mathbf{X} \alpha) $$ $$\text{subject to } \lVert \alpha \rVert =1,\: \alpha^T \mathbf{S} \hat{\varphi}_{\ell} =0,\: \ell=1,\dots,m−1.$$

Here, the unit-norm vector $\alpha$ is the relative weighting of the predictors that will be added to the model at that step. $\operatorname{Var}(\mathbf{X} \alpha)$ is the variance among the observations of that weighted sum of predictor values. If the scales of the predictor values are transformed, that variance and thus the model itself is fundamentally transformed in a way that can't be undone by a simple change of units of the regression coefficients. So these are not scale-invariant modeling procedures.

The usual procedure to maintain equivalence among continuous-valued predictors for such modeling approaches is to transform them to zero mean and unit standard deviation before anything that requires comparisons among predictors. Categorical predictors require some thought in terms of how to put them into "equivalent" scales with respect to each other or to continuous predictors, particularly if there are more than 2 categories. See this page and its links for some discussion.

From *Introduction to Probability*, second edition, by Blitzstein and Hwang: "**Definition 5.2.5** (Location-scale transformation). Let $X$ be a random variable and $Y = \sigma X + \mu$, where $\sigma$ and $\mu$ are constants with $\sigma > 0$. Then we say that $Y$ has been obtained as a location-scale transformation of $X$. Here $\mu$ controls how the location is changed and $\sigma$ controls how the scale has changed." — The Pointer, Sep 06 '20 at 19:47
I don't really find this answer informative. There's a lot of words, but, after reading it all, it still isn't clear what the concrete answer to my question is. I think erikperkerson's short answer was highly informative: — The Pointer, Sep 06 '20 at 22:42
"I was under the impression that scale invariant usually means invariant with respect to a dilation (a proper linear mapping, like $f(x) = kx$ for some constant $k$), such as the unit conversion from miles to millimeters that EdM suggested. The example of converting C to F is not a dilation, because it is an affine linear mapping like $f(x) = kx + b$ instead of a proper linear mapping. Invariance under affine linear mappings would imply both scale and shift invariance." — The Pointer, Sep 06 '20 at 22:42

score 0 · Answer 3 · answered Sep 06 '20 at 22:46

I think the comment by user "erikperkerson" was short and highly informative:

I was under the impression that scale invariant usually means invariant with respect to a dilation (a proper linear mapping, like $f(x) = kx$ for some constant $k$), such as the unit conversion from miles to millimeters that EdM suggested. The example of converting C to F is not a dilation, because it is an affine linear mapping like $f(x) = kx + b$ instead of a proper linear mapping. Invariance under affine linear mappings would imply both scale and shift invariance.

What does it mean to say that a regression method is (not) "scale invariant"?

3 Answers3