How to prove that a separate linear model for each class is equivalent to using interaction with the class?

Question

I learned in this question's comment by @amoeba that

Also, the difference between "building separate models" and "using categorical variable" is not clear to me. activity ~ condition + species + condition*species - this uses species as categorical variable, but this is fully equivalent to a separate regression activity ~ condition for each species separately.

I did a numerical example and observed they are the same (by checking the difference between two vectors). But how to mathematically proof this? I think it is related to design matrix with categorical variable dummy coding?

PS: here is my my numerical experiment

# code to show build model for each cut group is equal to using the formula
# price ~ carat + cut + carat * cut

# -------------------------------------------------------------------------
# make data, 3 classes, only 1 feature
# -------------------------------------------------------------------------
library(dplyr)
d=ggplot2::diamonds
d=d[,c("cut","carat","price")]
d=subset(d,cut %in% c("Very Good","Premium","Ideal"))
d$cut=factor(d$cut, ordered=F)
d

# -------------------------------------------------------------------------
# method 1, build 3 models for each class by using subsets
# -------------------------------------------------------------------------

s1=subset(d,cut=="Very Good")
fit1=lm(price ~ carat,s1)
p_s1=predict(fit1,s1)

s2=subset(d,cut=="Premium")
fit2=lm(price ~ carat,s2)
p_s2=predict(fit2,s2)

s3=subset(d,cut=="Ideal")
fit3=lm(price ~ carat,s3)
p_s3=predict(fit3,s3)

# -------------------------------------------------------------------------
# method 2, build 3 models by using interaction in formula
# -------------------------------------------------------------------------

fit0=lm(price ~ carat + cut + carat * cut,d)
p0=predict(fit0,d)

# -------------------------------------------------------------------------
# show they are the same
# -------------------------------------------------------------------------
# sort and show they are the same
d_ext=cbind(d,p=p0)
d2_ext=rbind(cbind(s1,p=p_s1),cbind(s2,p=p_s2),cbind(s3,p=p_s3))

d_ext=arrange(d_ext,cut,carat,price)
d2_ext=arrange(d2_ext,cut,carat,price)

norm(as.matrix((d_ext$p-d2_ext$p)))

I am not familiar with R, but can LM give you the design matrix? I would think you could look at it's sparsity structure (i.e. non-zeros) to investigate. (Dummy coding is like one-hot, right? So no zeros will come out of dot-products due to +/- cancellations, so you can ignore sign?) — GeoMatt22, Apr 30 '17 at 03:15
@GeoMatt22 I tried to see the design matrix and dummy coding, I can see why they are the same, but just wondering if there is there any elegant math proof with matrix algebra. — Haitao Du, May 01 '17 at 01:53
OK. I was not sure if the "automagic" re-coding R does under the hood was obscuring this. (I don't have much R experience, so not sure what can be queried.) — GeoMatt22, May 01 '17 at 02:05
This question is answered at https://stats.stackexchange.com/questions/13112. My answer there demonstrates mathematically that the t-tests to compare the main effects differ between the models. At the end of an answer to a related question I also discuss this issue briefly: https://stats.stackexchange.com/a/12809/919. No algebra is needed at all: when you write down the two models, you see that the two-regression model has an extra (identifiable) parameter, which ought to make everything clear. — whuber, May 26 '17 at 17:00
thanks for @whuber's comment in the chat: Separate linear regressions will result in different estimates of the variance of the error terms. A single regression with an interaction will produce a single ("pooled") estimate of the error terms. In so doing, it's likely to produce slightly different coefficient estimates, too. Thus, your choice of how to proceed should be determined in large part by your assumptions about homoscedasticity and any evidence to the contrary. It's really just a generalization of the distinction between t-tests assuming equal or unequal variances. — Haitao Du, May 26 '17 at 17:41
@whuber I understand that the error variance will be different and so everything dependent on it (like t-tests for individual parameters) can differ too. But how can coefficient estimates differ? In the example quoted by hxd1011 in the beginning of this Q, we are talking about $y=ax+b$ for each of the $i=1\ldots k$ groups separately vs. $y=a_i x + b_i$ for all groups together; shouldn't this produce exactly equivalent coefficients? — amoeba, May 28 '17 at 10:38
@Amoeba You are correct: in the case of least squares regression, the objective function is the same in both cases and so the coefficient estimates must be the same. — whuber, May 28 '17 at 13:47

score 4 · Answer 1 · answered Apr 30 '17 at 17:24

4

It is not quite the same. Why it is about the same can be seen from writing down the regression equations for each case and spotting that you have (with suitable labeling) the same terms describing the mean response in each subgroup.

Why it is not quite the same is, because you in one case assume (and estimate) a single variance across all subgroups, while the separate models have a separate variance for each subgroup.

answered Apr 30 '17 at 17:24

Björn

21,227
2
26
65

1

+1 could you tell me more why the are not the same? the predicted value are identical (as verified by norm on difference) – Haitao Du May 01 '17 at 01:52
2

@hxd1011 I think the answer was implying that the prediction [*variance*](https://en.wikipedia.org/wiki/Prediction_interval#Regression_analysis) would differ, though the prediction means would agree (is your R code checking both?) – GeoMatt22 May 01 '17 at 01:58
@GeoMatt22 thanks! This helped me. I come from "machine learning" where, making predictions, that is it, not thinking too much on point estimate and variance. – Haitao Du May 01 '17 at 02:17
1

@hxd1011 OK. For ML the "variance" can still come into play for model selection or hyperparameter tuning though, no? (e.g. here, could per-class CV perform differently than aggregated interaction-model CV?) – GeoMatt22 May 01 '17 at 02:30
@GeoMatt22 are you saying two different approaches will have different positions in bais-variance curve? figure 6 on section 4.4 of this [link](http://scott.fortmann-roe.com/docs/BiasVariance.html) – Haitao Du May 01 '17 at 02:49
1

@hxd1011 even more basic: the training data will differ, no? (all data together, vs. separate data per class) Without getting bogged down in details, I really only meant that the "variance" (i.e. training error, RMS residuals) is not just ignored in ML. – GeoMatt22 May 01 '17 at 03:09

How to prove that a separate linear model for each class is equivalent to using interaction with the class?

1 Answers1