Slope of independent variable is larger when I divide sample into subsets

Question

I'm finding unexpected results, and I'm unsure if I did not make a mistake somewhere.

In OLS regressions using a full sample, I find that the slope of the independent variable of interest is 0.03 and significant. However, when I divide the sample into two subsets (whereby I divide the sample about 50/50) I find significant results for both OLS regressions using the subsets. However, the slopes are 0.08 and 0.07. This feels a bit counterintuitive, as I don't understand why the slope can be larger when I divide the sample. Instead, I would expect that the slope using the full sample should lie in between the slopes using the subsets.

Does anyone know whether my findings could still hold?

Please provide more detailed and precise descriptors. For example, you say that you find the independent variable of interest is 0.03 and significant. It is assumed that you are referring to the slope (partial slope) coefficient for the model. — Gregg H, Apr 02 '18 at 13:26
I'm sorry, you're right. The slope coefficient is 0.03 when I regress on the full sample. When I split the sample into two subsets, the slope coefficient is 0.08 and 0.07. All coefficients are significant. — WPMB, Apr 02 '18 at 13:32
This is encompassed in what's called the _Simpson's paradox_. — Firebug, Apr 02 '18 at 14:56
This sounds like an instance of Simpson's Paradox. [Here's](https://stats.stackexchange.com/questions/316319/does-simpsons-paradox-cover-all-instances-of-reversal-from-a-hidden-variable/318481#318481) a great SE discussion where Carlos gently corrected my misunderstanding. As for what to do, it sounds like an incidental discovery, so should be reported as such, with your *main* analysis being the 0.03 finding. — AdamO, Apr 02 '18 at 15:05
My post at https://stats.stackexchange.com/a/13317/919 illustrates what happens to least-squares slopes when you break the regression into subsets according to values of a regressor. Looking at it might help your intuition. — whuber, Apr 02 '18 at 15:16
Was this a random division, or did you divide according to some feature? — Acccumulation, Apr 02 '18 at 21:43

score 18 · Answer 1 · answered Apr 02 '18 at 14:05

This is a very common scenario when you split your data into groups that differ systematically. Here's an example:

set.seed(4218)
N = 100
group <- rep(1:2, each = N%/%2)
x <- rnorm(N)
y <- sqrt(.2) * x + sqrt(.8) * rnorm(N)
x[group==2] <- x[group==2]+5
splitByGroup <- split(cbind.data.frame(x=x,y=y), group)

modelAll <- lm(y~x)
modelG1 <- lm(y~x, data=splitByGroup[[1]])
modelG2 <- lm(y~x, data=splitByGroup[[2]])

plot(y~x, col = group + 1)
abline(coef=modelAll$coef, col = 4, lwd = 2)
abline(coef=modelG1$coef, col = 2, lwd = 2)
abline(coef=modelG2$coef, col = 3, lwd = 2)
legend("topleft", col = 2:4, lwd=2,
    legend = paste("Slope for", c("group 1", "group 2", "both groups"), "=",
        round(c(modelG1$coef[2],modelG2$coef[2],modelAll$coef[2]),3)))

Here, the overall slope is attenuated because we've failed to account for group. If we include group in our model, however, we can recover a weighted average of the within-group slopes (as you intuited):

modelAll2 <- lm(y~x + group)
summary(modelAll2)$coef

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept)  1.3079814 0.45867941  2.851624 5.315169e-03
x            0.3730811 0.08017352  4.653420 1.034285e-05
group       -1.5352505 0.42212272 -3.636977 4.440556e-04

much more effective presentation than my verbal attempt – Gregg H Apr 02 '18 at 15:14 — Gregg H, Apr 02 '18 at 15:14

score 2 · Answer 2 · answered Apr 02 '18 at 13:37

This is possible due to the possible nature of different distributions for different subsets of the data. This will be hard to describe without a graph, but imagine a small cloud of points in the upper-left corner of a square. Let's assume the points follow closely to a cigar-shaped cloud of points that would suggest a positive correlation. Now, imagine another duplicate cloud of points, but in the lower-right corner of the square. Let's keep the clouds of points far enough apart so that they don't really overlap.

If you take the regression for each cloud separately, you will obtain a positive slope. However, if you combine the data sets into one (and ignore the grouping structure), then the aggregate regression will have a negative slope.

This example is a little more extreme than what you have described, but the same general theory could explain what you are observing.

Slope of independent variable is larger when I divide sample into subsets

2 Answers2