How do the non-informative predictors affect the variance explained in a regression model

Question

Is it possible the variance explained by say X predictors to be lower than the model that contains (X-m) predictors (X-m variables is a subset of the X variables)? I know that the more predictors the higher the variance explained. I would expect for the model with the X predictors to have, if not higher, at least equal variance explained with the model that contains the (X-m) predictors.

In the specific case, the model is linear regression model with continuous response and with independent variables of age, gender as well as several genotype variables which take values (0, 1 or 2). What varies between the two models is the number of genotype variables.

Any thoughts?

It seems that you have found a case in which this occurred. Please show the details of the study (nature of predictor and outcome variables, structure of the regression models) and the reports of the results of the regression models. — EdM, Sep 28 '19 at 16:57
That helps, but it will still be hard to say what's going on without seeing some sample results. In particular, it's important to see the results on "variance explained." For example, the adjusted $R^2$ reported by some software is adjusted downward by the number of predictors in the models and could lead to an apparent decrease in "variance explained" if you use that measure. Please edit your question again to show results from two models having different numbers of genotype variables, including what you are using for a measure of "variance explained." — EdM, Sep 28 '19 at 17:27

score 2 · Answer 1 · answered Oct 11 '19 at 02:30

Adding a single predictor to a prior model should not decrease the fraction of variance explained in an ordinary least-squares regression or ANOVA. Recognizing that ANOVA is equivalent to linear regression, recall that the coefficient of determination in a linear regression, $R^2$, is the fraction of variance explained by the model. As the objective of the regression is to minimize the unexplained variance (or the residual sum of squares, $SS_{res}$), the Wikipedia entry notes:

Minimizing $SS_{res}$ is equivalent to maximizing $R^2$. When the extra variable is included, the data always have the option of giving it an estimated coefficient of zero, leaving the predicted values and the $R^2$ unchanged. The only way that the optimization problem will give a non-zero coefficient is if doing so improves the $R^2$.

I see a couple of ways that you might appear to see a decrease in $R^2$ when a predictor is added to a model.

First, the adjusted $R^2$ reduces the $R^2$ to less than the variance explained, as a function of the number of predictors and cases. That is an attempt to account for the above-noted non-decreasing nature of $R^2$ as the number of predictors increases. Again, as the Wikipedia entry puts it:

Unlike $R^2$, the adjusted $R^2$ increases only when the increase in $R^2$ (due to the inclusion of a new explanatory variable) is more than one would expect to see by chance.

Thus the adjusted $R^2$ can decrease as you add predictors.

Second, regression software often silently removes cases that do not have complete data for all the predictor and outcome variables. If you add a predictor that has missing values for some of the cases handled by the smaller model, it's possible to get a lower $R^2$ due to the loss of cases, particularly if the lost cases were fit very well by the smaller model.

How do the non-informative predictors affect the variance explained in a regression model

1 Answers1