R2 increases if random variable is added

Question

In regression, I thought if you added a predictor unrelated to the criterion, R2 would stay the same. However R2 increases non-trivially in the example below, even though the correlation between response and predictor3 is virtually 0. What's going on?

More generally, can someone tell me what happens to R2 and betas under these conditions:

Add predictor unrelated to y or any other predictors (assume that R2 and betas remain unchanged?)
Add predictor unrelated to y but related to other predictors (apparently R2 goes up and the predictors remain unchanged?)

R code:

R = matrix(cbind(1,.80,.2,0, 
             .80,1,.7,.3,  
             .2,.7,1, .3,
             0,.3,.3,1),nrow=4)
U = t(chol(R))
nvars = dim(U)[1]
numobs = 100000
set.seed(1)
random.normal = matrix(rnorm(nvars*numobs,0,1), nrow=nvars, ncol=numobs);
X = U %*% random.normal
newX = t(X)
raw = as.data.frame(newX)
names(raw) = c("response","predictor1","predictor2","predictor3")
cor(raw)

lm1<-lm(response ~ predictor1 + predictor2, data=raw)
lm2<-lm(response ~ predictor1 + predictor2 + predictor3, data=raw)

summary(lm1)
summary(lm2)

Could you tell us what you mean by "unrelated to"? Your code, because it appears to work by generating random values, isn't necessarily generating variables that are uncorrelated with each other. — whuber, Nov 05 '17 at 16:27
It generates data given a particular correlation matrix R. If you look at cor(raw) right before the models, you can see the correlation matrix for the data. The actual correlation matrix very closely approximates the specified matrix R. So it's not the case that I'm just adding a randomly generated predictor - it's a predictor I specify to be uncorrelated with y. — dlr1234, Nov 05 '17 at 16:30
See https://stats.stackexchange.com/questions/167827/why-is-sum-of-squared-residuals-non-increasing-when-adding-explanatory-variable/167832#167832 (could be a duplicate). — kjetil b halvorsen, Nov 05 '17 at 16:40
"Very closely approximates" is not the same as "exactly equal." It's not apparent what you're trying to do with your code. I don't see how it constructs any predictor that produces *data* guaranteed to be uncorrelated with the response. — whuber, Nov 05 '17 at 17:07
Even if you add a variable that is noise some small increase in the estimated R$^2$ can happen. That is why basing the model on R$^2$ is problematic. — Michael R. Chernick, Nov 05 '17 at 19:04

score 2 · Answer 1 · answered Nov 05 '17 at 20:28

Note that the residual vector can have a component that is perpendicular to the response vector, and therefore the zero correlation between an additional variable and the response, is not relevant (sufficient).

The additional variable in the models can decrease the residuals, increasing the $R^2$, even when it does not correlate with the response...

...by correlating with the difference between the response and the earlier variables.

Simple example:

$$\begin{bmatrix}1\\0 \end{bmatrix} = a_1 \begin{bmatrix}1\\1 \end{bmatrix} + a_2 \begin{bmatrix}0\\1 \end{bmatrix} + \epsilon $$

which clearly shows how a model variable that is perpendicular to the response can still play a role in the model by correlating with the difference between the response and other model vectors.

But indeed, it is different when the additional variable is perpendicular to both response and other variables:

$$\begin{bmatrix}1\\0\\0 \end{bmatrix} = a_1 \begin{bmatrix}1\\1 \\0\end{bmatrix} + a_2 \begin{bmatrix}0\\0\\1 \end{bmatrix} + \epsilon$$

In that case it can't be correlating with previous residual term either. since a sum vectors, which are perpendicular to a vector $x$, is also perpendicular to $x$.

HEITZ · Answer 2 · 2017-11-05T18:29:57.003

1

I'm not an expert in Cholesky decomposition, but appears that the uncorrelated variable somehow retains some structure related to the response after predictor1 and predictor2 are accounted for.

cor(resid(lm1),raw$predictor3)
[1] -0.5435588

If you instead generate predictor3 manually, you'll find that your R-sqare value does not change

raw$predictor3 = runif(length(raw$predictor3),0,1)
lm1<-lm(response ~ predictor1 + predictor2, data=raw)
lm2<-lm(response ~ predictor1 + predictor2 + predictor3, data=raw)

summary(lm1)['r.squared']$r.squared
[1] 0.8948916

summary(lm2)['r.squared']$r.squared
[1] 0.8948917

I'd love to know why...

edited Nov 05 '17 at 18:29

answered Nov 05 '17 at 18:26

HEITZ

1,682
7
15

2

When you generate a new regressors randomly, almost surely its residuals (with respect to the other regressors) will have nonzero correlation with the residuals of the response: that's all that's going on here. The only way you can fail to increase $R^2$ is by guaranteeing the new regressor is *exactly* orthogonal to the (previous) residuals. – whuber Nov 05 '17 at 18:29

R2 increases if random variable is added

2 Answers2