Multiple regression - conceptual questions

Question

Problem statement:

I'm working on a multiple regression after running an RCT, to confirm treatment effectiveness and quantify effect size. Initially, when I regressed a dependent variable against some dummy treatment variables, I have significant results, although my Pseudo $R^2$ wasn't that fantastic (0.3%).

When I added some control variables to 'disentangle effects' - the Pseudo $R^2$ went up (1.3%), which is great. Dummy variables remained significant. 2 of the control variables weren't significant, 2 were.

When I started adding interaction terms (20-30), practically all coefficients, including those for my dummy variables, lost significance. However, my Pseudo $R^2$ went up to ~2%.

I'm trying to find the point at which I stop and decide that I've found my ideal model, but this mass of inputs is messing me up, so I want to disentangle my own conceptual misunderstandings.

.....

Question(s):

Why does $R^2$ continue to increase (and remain significant) with more terms added, even if practically all my coefficients are nonsignificant? What does this mean?
If multiple regression is about 'disentangling effects' - why would terms which were originally significant, become insignificant after adding more control variables? Shouldn't they 'hold their ground' since they are significant (or insignificant)?
At which point do I stop - and decide that a model is best? The one with 40 nonsignificant coefficients but the highest $R^2$, or the parsimonious model with fewer coefficients (4 sig, 2 non-sig), with a lower $R^2$?

score 4 · Answer 1 · answered Feb 10 '22 at 14:53

4

Timm provided a good answer, I just want to add some aspects.

Q2: In multiple regression, the correlation between different predictors in a model is automatically considered during parameter estimation. Thus, it is not uncommon that a predictor variable becomes insignificant after adding control variables, because the predictor doesn't explain any variance that isn't covered by the control variables.

Q3: As already mentioned, you should consider theoretical underpinnings. Apart from this, just choosing the model with the highest R² is definetely not a good practice, because R² will always increase when you add additional predictors. Apart from regulization methods, you could use Selection Strategies such as forward or backward selection, if you don't have too many predictors.
https://quantifyinghealth.com/stepwise-selection/

answered Feb 10 '22 at 14:53

Frank Gallagher

51
5

Thank you for this response. Re Q3: I'm comparing nested models (since I want to see whether adding more predictors helps), so AIC/BIC may not work well in my context. Do you think I'd need to add predictors 1-by-1 and do Likelihood ratio tests to find the best model? (I've around 8 demographic dummy variables and am thinking of dropping interactions since I have no strong basis - so it's not too straining) – lionclw Feb 11 '22 at 14:06
You could do e. g. forward selection with the 8 control variables (step by step add the single variable that increases the R² the most). As a stopping rule, you could execute a F-Test to check if the increase in R² due to adding another variable is significant and stop if it isn't. Try to think about what your aim really is: If you're controls have a theoretical underpinning, I would leave them in the model. Because in a RCT, in my opinion saying that a treatment is effective while having controlled for "XYZ" is more important than finding the best regression model – Frank Gallagher Feb 11 '22 at 15:14

timm · Answer 2 · 2022-02-11T10:42:08.797

First, I recommend to think about the "causal story" that produces your response variable and to not just add regressors to your model. You probably have some theory in mind that explains the phenomenon you are interested in. In the relevant literature you should find control variables you should add to your model. A tool that might help you find the "causal story" are directed acyclic graphs (DAGs).

To your questions:

Why does the $R^2$ increase? It's a property of the $R^2$ to increase when you add further controls to your model, it will not decrease. Just have a look at the formula - it is good excercise to check that property. Also check the adjusted $R^2$, which corrects for adding further variables to your model.
To your second question: adding 'irrelevant' variables can be harmful. See this post here: Ref1. Note: you should not leave out a variable just because its parameter estimate turns out to be insignificant, especially if it's a legitimate control proposed by some theory.
At which point do I stop - and decide that a model is best? I would not tackle the problem from this perspective, as I wrote in beginning: think about the "causal story".

For further research: read about regularization - especially the papers of Belloni et al. are helpful. This is a little bit more advanced, though.

Multiple regression - conceptual questions

2 Answers2