Model coefficent vs subsetting data

Question

Suppose I had a multiple regression model with a binary predictor. I can think of two ways in testing is "important" in affecting the response variable.

Use a t-test and its p-value on the coefficient of the predictor (like what is returned by lm: Pr(>|t|)).
Create two subsets of the data based on the binary predictor and create two confidence intervals for the response variable. Then check if the intervals overlap.

What is the difference between the two methods? Is any information "lost" when doing the subsetting? I've seen both presented and I don't know when to use which.

Possibly related? What is the difference between confidence intervals and hypothesis testing?

Robert Long · Accepted Answer · 2018-11-09T20:23:42.420

If the confidence intervals overlap, it is still possible that the estimates are significantly different. If they do not overlap, then they are significantly different. With this in mind, also note that splitting the dataset on the binary variable is not a great idea because you are losing statistical power. The more unbalanced the groups are (many more 0s than 1s or vice versa, or where the two groups have very different variances), the bigger this problem becomes.

Let's look at a simple example, where the variable is statistically significant, but splitting on the variable results in overlapping confidence intervals:

set.seed(9876)
N <- 1000

x <- rbinom(N, 1, 0.2)
y <- 0 + x + rnorm(N, 0, 6)

dt <- data.frame(y, as.factor(x))
m0 <- lm(y ~ x, data = dt)

summary(m0)

This produces:

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -0.0967     0.2116  -0.457   0.6478  
x             1.0216     0.4767   2.143   0.0324 *

So x is significant at the 5% level.

Now, we split the data, as suggested in the OP and form confidence intervals for the response variable:

dt1 <- subset(dt, x == 0)
dt2 <- subset(dt, x == 1)

m1 <- lm(y ~ 1, data = dt1)
m2 <- lm(y ~ 1, data = dt2)

lapply(list(m1, m2), confint)

which produces:

[[1]]
                 2.5 %    97.5 %
(Intercept) -0.5180074 0.3246138

[[2]]
                2.5 %   97.5 %
(Intercept) 0.1338219 1.715985

As these intervals overlap, we cannot conclude anything about statistical significance. This is equivalent to not rejecting the null hypothesis that the means of the two groups are equal.

So, it is better to use the t-test from the model that includes all the data.

Thank you for your quick response. As an aside, do you think there is a situation where subsetting the data has any advantage? Though it has less power, I think it may reduce the number of parameters being estimated which makes the model simpler. — qwr, Nov 09 '18 at 19:26
You are welcome. This is an interesting question - please post it as a new question on the site, and ping me in a comment and I will try to answer — Robert Long, Nov 09 '18 at 19:34
@qwr if you think my answer answers *this* question, plesse mark it as the accepted answer. — Robert Long, Nov 10 '18 at 15:54

Model coefficent vs subsetting data

1 Answers1