In linear regression, is "controlling for a variable" implemented by one-hot encoding/dummy variable/design matrix?

Question

The literal meaning of "controlling for a variable" is self-explanatory -- it means we want to "isolate" the effect of the controlled variable to study the effect of those "uncontrolled" variable. (correct me if I am wrong...)

But my question is, how is this implemented? My current understanding is as follows: Let's say we want to study the relationship between $weight$ (a numeric explanatory variable) and $type$ (a categorical explanatory variable, can only be $A$ or $B$) and $size$ (the response variable) of mice and we want to use linear regression as our model. We have eight observations:

## ID size  type  weight
## 1  1.9   A     2.4
## 2  3.0   A     3.5
## 3  2.9   A     4.4
## 4  3.7   A     4.9
## 5  2.8   B     1.7
## 6  3.3   B     2.8
## 7  3.9   B     3.2
## 8  4.8   B     3.9

Suppose we want to study the relation between only $weight$ and $size$, controlling for $type$, then we do the following, we use one-hot encoding/dummy variable (machine learning-speak) or design matrix (statistics-speak) (correct me if I am wrong, it seems to me that all three terms are more or less the same) and covert the observation matrix to the following:

## ID size  type.A  type.B  weight
## 1  1.9   1       0       2.4
## 2  3.0   1       0       3.5
## 3  2.9   1       0       4.4
## 4  3.7   1       0       4.9
## 5  2.8   1       1       1.7
## 6  3.3   1       1       2.8
## 7  3.9   1       1       3.2
## 8  4.8   1       1       3.9

and then we run linear regression (for the purpose of this example, I specifically disabled the intercept):

## 
## Call:
## lm(formula = size ~ 0 + type.A + type.B + weight, data = df)
## 
## Residuals:
##        1        2        3        4        5        6        7        8 
##  0.05455  0.34562 -0.41623  0.01607 -0.01753 -0.32646 -0.02062  0.36461 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)   
## type.A  0.08052    0.52744   0.153  0.88463   
## type.B  1.48685    0.26023   5.714  0.00230 **
## weight  0.73539    0.13194   5.574  0.00256 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3275 on 5 degrees of freedom
## Multiple R-squared:  0.9942, Adjusted R-squared:  0.9906 
## F-statistic: 283.3 on 3 and 5 DF,  p-value: 5.316e-06

According to the above result, we can draw the conclusion that "controlling for variable $type$", there is a statistically significant positive correlation between weight and size ($coefficient = 0.73539$ and $p-value < 0.05$)

On the other hand, if we just regress $weight$ on $size$, we get the following:

## 
## Call:
## lm(formula = size ~ weight, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01568 -0.45927 -0.01793  0.33862  1.29724 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.9763     1.0270   1.924    0.103
## weight        0.3914     0.2941   1.331    0.232
## 
## Residual standard error: 0.8203 on 6 degrees of freedom
## Multiple R-squared:  0.2279, Adjusted R-squared:  0.09925 
## F-statistic: 1.771 on 1 and 6 DF,  p-value: 0.2316

This time, the $weight$'s $p-value == 0.232 > 0.05$ and the $intercept$'s $p-value == 0.103 > 0.05$, implying that neither the $intercept$ nor $weight$ is likely to be a significant factor in determining the $size$ of a mouse.

Is the above understanding correct?

Credit: the toy dataset is from a StatQuest video: https://www.youtube.com/watch?v=Hrr2anyK_5s

score 0 · Answer 1 · answered May 04 '21 at 07:48

I asked a few professors and the below is the answer I get:

Overall, the understanding is mostly correct. However, there are a few points I would like to mention:

Design matrix is not a term all textbooks would use, according to the definition from Wikipedia: https://en.wikipedia.org/wiki/Design_matrix#Definition, it is the matrix to multiply with the coefficients vector, so basically it is just another name for the $X$ matrix, i.e., the explanatory variables part of a dataset. So design matrix does not only include those one-hot encoded/dummy-variable columns, but in fact includes all columns excluding the $y$ column.
Yes, ANOVA is the most basic form of linear regression.
When the p-value is small, it means we can reject the null hypothesis that the coefficient is not zero. The significance of this conclusion is from the statistical aspect. It means that this conclusion may or may not lead to the final conclusion you draw in the area you are researching, e.g., the conclusion that gene X plays a big role in mice's size, household income explains 80% of the ability to repay a loan, etc. So you need to be careful drawing a further conclusion from p-value.

This does not answer your question, which is "how is this implemented?" The difficulty here is that you are asking and answering multiple questions. That doesn't fit our framework, where we must focus on a single, clear question in each thread. Please consult our [help] for more guidance. — whuber, Dec 22 '21 at 14:57

In linear regression, is "controlling for a variable" implemented by one-hot encoding/dummy variable/design matrix?

1 Answers1