Say I am running a regression
$Y \sim X_1 + X_2$, $X_1$ and $X_2$ have drastically different range. Empirically, will these cause any issues on my prediction of $Y$? If the answer is yes, how do I handle it?
Say I am running a regression
$Y \sim X_1 + X_2$, $X_1$ and $X_2$ have drastically different range. Empirically, will these cause any issues on my prediction of $Y$? If the answer is yes, how do I handle it?
While I believe that the reason to centre/standardise your data should mostly be of statistical nature (as discussed in detail in the link provided in my original comment) numerics do come into play.
Common implementations of linear model regression are based on the QR decomposition and if it fails and you do not notice you are in trouble. See for example the following exampled based on the workhorse routines of two "decent stats software" packages: R and MATLAB; the weapons of choice for many Stats-ML-AI warriors out there (myself included :D ).
set.seed(1234);
n <- 100
xx <- sort(runif(n))
y <- 0.5*(xx-0.5)+(xx-0.5)^2 + rnorm(n)*0.1
x <- xx+1001
(toymod <- lm(y~x+I(x^2))) # Notice that we do not get a single warning!
Call:
lm(formula = y ~ x + I(x^2))
Coefficients:
(Intercept) x I(x^2)
-451.4685 0.4509 NA
If one is careful and checks the model summary he might be suspect something:
summary(toymod)
Call:
lm(formula = y ~ x + I(x^2))
Residuals:
Min 1Q Median 3Q Max
-0.20467 -0.08075 -0.00755 0.07144 0.30560
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -451.46848 43.00008 -10.5 <2e-16 ***
x 0.45088 0.04294 10.5 <2e-16 ***
I(x^2) NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1191 on 98 degrees of freedom
Multiple R-squared: 0.5294, Adjusted R-squared: 0.5246
F-statistic: 110.3 on 1 and 98 DF, p-value: < 2.2e-16
but then again he needs to know what he is looking for (ie. be able to understand that the model singularities come from the ill-conditioned for the design matrix $X$).
The same exact code will not fail in MATLAB:
rng(1234)
n = 100;
xx = linspace(0,1,n)';
y = 0.5*(xx-0.5)+(xx-0.5).^2 + randn(n,1)*0.1;
x = xx+1001;
toymod = fitlm([x x.^2], y)
toymod =
Linear regression model:
y ~ 1 + x1 + x2
Estimated Coefficients:
Estimate SE tStat pValue
__________ __________ _______ __________
(Intercept) 1.2401e+06 1.2911e+05 9.6055 9.5382e-16
x1 -2477.1 257.83 -9.6075 9.4446e-16
x2 1.2369 0.12872 9.6094 9.3519e-16
Number of observations: 100, Error degrees of freedom: 97
Root Mean Squared Error: 0.0979
R-squared: 0.77, Adjusted R-Squared 0.765
F-statistic vs. constant model: 162, p-value = 1.14e-31
but that is only due to the fact that MATLAB uses a different LAPACK library. Clearly we can mess-up MATLAB too by simply increasing the magnitude of our $X$:
rng(1234)
n = 100;
xx = linspace(0,1,n)';
y = 0.5*(xx-0.5)+(xx-0.5).^2 + randn(n,1)*0.1;
x = xx+10001;
toymod = fitlm([x x.^2], y);
Warning: Regression design matrix is rank deficient to within machine precision.
> In TermsRegression>TermsRegression.checkDesignRank at 98
In LinearModel.LinearModel>LinearModel.fit at 868
In fitlm at 117
Luckily MATLAB makes some quite specific complaints about the design matrix used but other than that lets you go your merry way without any other issue.
As mentioned we did not do anything insane. We just forced on purpose lm and fitlm to use a model matrix $X$ that its condition number caused QR to fail. Centring, standardising or simply using another model altogether (a spline for example) would have taken care of this numerically problematic situation.