3

Say I am running a regression

$Y \sim X_1 + X_2$, $X_1$ and $X_2$ have drastically different range. Empirically, will these cause any issues on my prediction of $Y$? If the answer is yes, how do I handle it?

CuriousMind
  • 2,133
  • 5
  • 24
  • 32
  • Please check the following thread on [When should you center your data & when should you standardize?](https://stats.stackexchange.com/questions/29781/when-should-you-center-your-data-when-should-you-standardize/29783). To answer you question though: Yes, it can cause issues especially numerical ones as the condition number of your design matrix can be very large (ie. make your systems of equations unstable and lead to spurious results). – usεr11852 Mar 03 '15 at 21:22
  • @usεr11852 Any decent stats software standardizes variables internally, thereby avoiding the problem. Your warning certainly would apply to home-grown solutions. – whuber Mar 03 '15 at 22:01
  • @whuber: I think your comment is too optimistic. Please see my answer below. I have seen you using Mathematica for some graphs, a software I do not have access to and to my knowledge is not fully or partially open-sourced like R or MATLAB so I can check the code. Nevertheless, I am will be surprised if they do not use Intel's MKL to do a raw QR decomposition for their LM fitting. – usεr11852 Mar 04 '15 at 01:48
  • Your title talks about dependent variables, but $Y$ is the dependent variable, and $X_1$ and $X_2$ are independent variables (I'm not a fan of the terms myself, but at least they're less confusing to others when we don't get them backwards). Please fix your title. – Glen_b Mar 04 '15 at 03:01

1 Answers1

4

While I believe that the reason to centre/standardise your data should mostly be of statistical nature (as discussed in detail in the link provided in my original comment) numerics do come into play.

Common implementations of linear model regression are based on the QR decomposition and if it fails and you do not notice you are in trouble. See for example the following exampled based on the workhorse routines of two "decent stats software" packages: R and MATLAB; the weapons of choice for many Stats-ML-AI warriors out there (myself included :D ).

set.seed(1234); 
n <- 100
xx <- sort(runif(n))
y <- 0.5*(xx-0.5)+(xx-0.5)^2 + rnorm(n)*0.1
x <- xx+1001
(toymod <- lm(y~x+I(x^2))) # Notice that we do not get a single warning!
Call:
lm(formula = y ~ x + I(x^2))

Coefficients:
(Intercept)            x       I(x^2)  
  -451.4685       0.4509           NA  

If one is careful and checks the model summary he might be suspect something:

summary(toymod)

Call:
lm(formula = y ~ x + I(x^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.20467 -0.08075 -0.00755  0.07144  0.30560 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -451.46848   43.00008   -10.5   <2e-16 ***
x              0.45088    0.04294    10.5   <2e-16 ***
I(x^2)              NA         NA      NA       NA    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1191 on 98 degrees of freedom
Multiple R-squared:  0.5294,    Adjusted R-squared:  0.5246 
F-statistic: 110.3 on 1 and 98 DF,  p-value: < 2.2e-16

but then again he needs to know what he is looking for (ie. be able to understand that the model singularities come from the ill-conditioned for the design matrix $X$).

The same exact code will not fail in MATLAB:

rng(1234)
n = 100;
xx = linspace(0,1,n)';
y = 0.5*(xx-0.5)+(xx-0.5).^2 + randn(n,1)*0.1;
x = xx+1001;
toymod = fitlm([x x.^2], y)    
toymod =     
Linear regression model:
    y ~ 1 + x1 + x2    
Estimated Coefficients:
                    Estimate         SE         tStat       pValue  
                   __________    __________    _______    __________

    (Intercept)    1.2401e+06    1.2911e+05     9.6055    9.5382e-16
    x1                -2477.1        257.83    -9.6075    9.4446e-16
    x2                 1.2369       0.12872     9.6094    9.3519e-16    

Number of observations: 100, Error degrees of freedom: 97
Root Mean Squared Error: 0.0979
R-squared: 0.77,  Adjusted R-Squared 0.765
F-statistic vs. constant model: 162, p-value = 1.14e-31

but that is only due to the fact that MATLAB uses a different LAPACK library. Clearly we can mess-up MATLAB too by simply increasing the magnitude of our $X$:

rng(1234)
n = 100;
xx = linspace(0,1,n)';
y = 0.5*(xx-0.5)+(xx-0.5).^2 + randn(n,1)*0.1;
x = xx+10001;
toymod = fitlm([x x.^2], y);
Warning: Regression design matrix is rank deficient to within machine precision. 
> In TermsRegression>TermsRegression.checkDesignRank at 98
  In LinearModel.LinearModel>LinearModel.fit at 868
  In fitlm at 117 

Luckily MATLAB makes some quite specific complaints about the design matrix used but other than that lets you go your merry way without any other issue.

As mentioned we did not do anything insane. We just forced on purpose lm and fitlm to use a model matrix $X$ that its condition number caused QR to fail. Centring, standardising or simply using another model altogether (a spline for example) would have taken care of this numerically problematic situation.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • Interesting work, thank you. I tried this out by standardizing `x^2` and using that in `lm`. Sure enough, `R` reported quantifiable results. But (1) they differ from those of Matlab and (2) `car::vif` reports VIFs around $10^8$. Standardizing *changed* the answer but it didn't solve the problem. Thus, you aren't really studying the effects of *ranges* of the variables: you are studying how these systems handle what is essentially *collinearity.* That makes the discrepancies understandable: the output will be extremely sensitive to details of the numerical algorithms. – whuber Mar 04 '15 at 02:05
  • @whuber: (0) Thanks. (1) R is using a non-standard way to initialize the Mersenne twister; so comparing the results is not supposed to work. I fix the seed for internal reproducibility only (2) Calling it collinearity is misleading. The SVD of $X$ will give a value of around 10e-7; this is far from being numerically zero and therefore this is not a collinear system. It is the ratio of largest and smallest singular values we are looking at here, ie. the condition number of the system, that mess up stuff. (cont.) – usεr11852 Mar 04 '15 at 02:39
  • Clearly the behaviour of the system is like it is rank-deficient but collinearity is not the root of the problem. Increasing `n` and standardising `x` should work and provide a correct fit. See for example: `xnew = scale(x); (toymod – usεr11852 Mar 04 '15 at 02:50
  • One test of collinearity is to look at the square of the residual standard error upon regressing one standardized variable against the other: if that mean squared error is of the size of floating point error, you've got perfect collinearity for all practical purposes. The value I obtain in `R` is around $1.5\times 10^{-14}$, which is darn near zero: you've only got two significant (decimal) digits left to play with. – whuber Mar 04 '15 at 03:00
  • I do not disagree that: "*the behaviour of the system is like it is rank-deficient*". Nevertheless as shown standardising will solve the original problem and provide one with reasonable estimates. I complete agree that if the MSE between any two variable is close to machine precision then they are collinear but I fail to see how this applies to the standardised variables here: `mod_1 – usεr11852 Mar 04 '15 at 03:29
  • It sounds like you have gotten completely away from the question! It asks whether the ranges of independent variables will affect the regression. You talk about situations with extreme degrees of collinearity, which is a totally different issue. If instead we were to simulate the situation described in the question, nothing would go wrong. Consider, for instance, `n – whuber Mar 04 '15 at 15:35
  • *The fact that it fits a model does not make it the right model.* This brings us once more to the point I raised about numerics. Do the SVD of the matrix $X$ (`svd(model.matrix( (lm(y ~ x+z))))$d == 0`) in the model you defined. The rank for all intended purposes should be 1. When you have a singular value of the order `e100` and two others at `e0` that system is rank-deficient for any numerical purposes. We do not standardize at any point and therefore we fail. – usεr11852 Mar 05 '15 at 02:43
  • BTW, I see your point about collinearity issue in my original code. You are right, nevertheless these collinearities are by-products of the variables' magnitude. If the magnitude differences where smaller one would not notice these. – usεr11852 Mar 05 '15 at 02:47