The literal meaning of "controlling for a variable" is self-explanatory -- it means we want to "isolate" the effect of the controlled variable to study the effect of those "uncontrolled" variable. (correct me if I am wrong...)
But my question is, how is this implemented? My current understanding is as follows: Let's say we want to study the relationship between $weight$ (a numeric explanatory variable) and $type$ (a categorical explanatory variable, can only be $A$ or $B$) and $size$ (the response variable) of mice and we want to use linear regression as our model. We have eight observations:
## ID size type weight
## 1 1.9 A 2.4
## 2 3.0 A 3.5
## 3 2.9 A 4.4
## 4 3.7 A 4.9
## 5 2.8 B 1.7
## 6 3.3 B 2.8
## 7 3.9 B 3.2
## 8 4.8 B 3.9
Suppose we want to study the relation between only $weight$ and $size$, controlling for $type$, then we do the following, we use one-hot encoding/dummy variable (machine learning-speak) or design matrix (statistics-speak) (correct me if I am wrong, it seems to me that all three terms are more or less the same) and covert the observation matrix to the following:
## ID size type.A type.B weight
## 1 1.9 1 0 2.4
## 2 3.0 1 0 3.5
## 3 2.9 1 0 4.4
## 4 3.7 1 0 4.9
## 5 2.8 1 1 1.7
## 6 3.3 1 1 2.8
## 7 3.9 1 1 3.2
## 8 4.8 1 1 3.9
and then we run linear regression (for the purpose of this example, I specifically disabled the intercept):
##
## Call:
## lm(formula = size ~ 0 + type.A + type.B + weight, data = df)
##
## Residuals:
## 1 2 3 4 5 6 7 8
## 0.05455 0.34562 -0.41623 0.01607 -0.01753 -0.32646 -0.02062 0.36461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## type.A 0.08052 0.52744 0.153 0.88463
## type.B 1.48685 0.26023 5.714 0.00230 **
## weight 0.73539 0.13194 5.574 0.00256 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3275 on 5 degrees of freedom
## Multiple R-squared: 0.9942, Adjusted R-squared: 0.9906
## F-statistic: 283.3 on 3 and 5 DF, p-value: 5.316e-06
According to the above result, we can draw the conclusion that "controlling for variable $type$", there is a statistically significant positive correlation between weight and size ($coefficient = 0.73539$ and $p-value < 0.05$)
On the other hand, if we just regress $weight$ on $size$, we get the following:
##
## Call:
## lm(formula = size ~ weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01568 -0.45927 -0.01793 0.33862 1.29724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9763 1.0270 1.924 0.103
## weight 0.3914 0.2941 1.331 0.232
##
## Residual standard error: 0.8203 on 6 degrees of freedom
## Multiple R-squared: 0.2279, Adjusted R-squared: 0.09925
## F-statistic: 1.771 on 1 and 6 DF, p-value: 0.2316
This time, the $weight$'s $p-value == 0.232 > 0.05$ and the $intercept$'s $p-value == 0.103 > 0.05$, implying that neither the $intercept$ nor $weight$ is likely to be a significant factor in determining the $size$ of a mouse.
Is the above understanding correct?
Credit: the toy dataset is from a StatQuest video: https://www.youtube.com/watch?v=Hrr2anyK_5s