What does it mean to control for a variable? How do I test to see if a predictor is useful or related to the dependent variable?

Question

For example, if I have the dependent variable Height and independent variables Age, Weight, and Shoe Size, how do I answer the following questions:

Controlling for Age and Weight, is Shoe Size related to Height?
Allowing for Age and Weight, is Shoe Size a useful predictor for Height?

Is question #2 just another way of asking question #1?

If so then, how do I find out if they're related (or that Shoe Size is a useful predictor)?

I tried:

Let $y=\beta_0+\beta_1x_1 + \beta_2x_2 + \beta_3x_3$

with

$x_1$ = Age, $x_2$ = Weight, and $x_3$ = Shoe Size

Then I should test the null hypothesis $\beta_1=\beta_2=0$ with $\alpha = 0.05$. If the p-value is less than $\alpha$, I can reject the null hypothesis, and thus, Shoe Size size is NOT related to Height. Is this correct?

Both questions are about shoe size ($x_3$), so it is not clear why you are testing the joint hypothesis that age and weight effects are zero. You should be focused on interpreting the statistical significance, sign, and magnitude of $\beta_3$. — dimitriy, Nov 08 '17 at 22:18
I think most would argue that 2. is not well defined. Some may even argue 1 is not well defined. There's a range of interpretations here, but clarifying 2 would be the first step. "Allowing" something seems to imply either *not* adjusting AND *adjusting*. — AdamO, Jul 29 '21 at 20:26

score 2 · Answer 1 · answered Nov 08 '17 at 22:23

2

Controlling used to mean literally controlling the variable in experiments. For instance, you would control the temperature and pressure while measuring absorption of light by gases. You set the temperature and pressure at different level, then measure your absorption.

These days controlling is figurative in observational studies. For instance, you're regressing the income of students on the race while controlling the income of their parents. Here controlling means that you added the parents income variable to the regression. The idea's similar, you want to measure the sensitivity of interest while isolated from other factors that may impact the dependent variable.

answered Nov 08 '17 at 22:23

Aksakal

55,939
5
90
176

1

"you want to measure the sensitivity of interest while isolated from other factors that may impact the dependent variable." This only works for a limited set of causal assumptions, & in many causal relationships adding a potential confounding variable—say, one that is causally prior to the IV, and which is associated with the DV—actually *introduces biases* into the estimate of causal effect. "Adding predictors" is no replacement for formal causal reasoning. See Greenland, S., Pearl, J., and Robins, J. M. (1999). Causal diagrams for epidemio- logic research. *Epidemiology*, 10(1):37–48. – Alexis Nov 08 '17 at 22:57
1

@Alexis establishing causality in observational studies is very difficult, because unlike in the experiments you don't really control anything in regression. I doubt that OP is asking about this – Aksakal Nov 09 '17 at 00:30
Look at partial correlation – HEITZ Nov 09 '17 at 07:02
@Aksakal $U_{1}$ and $U_{2}$ are unobserved variables. $L$ causes risk of $E$, and $D$. $U_{1}$ causes risk of $L$ and risk of $E$. $U_{2}$ causes risk of $L$ and risk of $D$. The question of interest is whether $E$ causes risk of $D$. In such a circumstance—real world examples of which abound—including $L$ in a multiple regression (or stratifying by $L$) produces a biased causal estimate of the effect of $E$ on $D$. If the OP is unconcerned with cause, why do the research? – Alexis Nov 09 '17 at 18:59
4

@Alexis OP seems to be operating at the stats 101 level, where "controlling for a variable" has a very specific and narrow meaning, which is unrelated to causality. In intro stats courses students are shown the SAS output and expected to explain it using phrases like "controlling for age and weight, the height increases by $\hat\beta$ for a unit change in shoe size." Obviously, it is assumed that the regression was set up properly in the first place. – Aksakal Nov 09 '17 at 19:17

score 1 · Answer 2 · answered Aug 27 '20 at 02:20

I'm going to offer a more pedantic answer.

What does it mean to control for a variable?

I think many people might say that including a variable in a regression model is "controlling" for that variable, but Jennifer Hill offers some good reasons NOT to say that. The word "control" should mean that we are, literally, in control of that variable much in the same way that we can control experiments with a control group. So to control for a age and weight would mean that we could somehow hold weight constant and vary age or vice versa. For obvious reasons, we can't do that. Instead we adjust for weight and age, and the words adjust and control are used (erroneously, in my opinion) interchangeably.

Is question #2 just another way of asking question #1?

Not strictly. In the former question, we could simply estimate the relationship between shoe size and height. If that relationship is very weak, then shoe size would not be a good predictor of height, though it might be related.

how do I find out if they're related (or that Shoe Size is a useful predictor)?

That might involve determining if the reduction in loss when using shoe size verses not using shoe size to predict height is big enough. Frank Harrell writes about this in one of his blog posts here.

score 0 · Answer 3 · edited Aug 27 '20 at 11:49

In my opinion, your approach to question 2 will help you to answer question 1.

That is, you should test:

$\beta_1=\beta_2=\beta_3=0$ with $\alpha=0.05$

This means you'll try to prove if coefficients are statiscally different for your three variables.

For question 2, you'll just see if t value associated with $\beta_3$ is statistically significant, i. e., variable should be part or not of your equation. Then I'd proceed with question 2; if $\beta_3$ has a significant $t $value, then answer question 1. Anyway, I hope someone else could help to clarify this.

What does it mean to control for a variable? How do I test to see if a predictor is useful or related to the dependent variable?

3 Answers3