How to perform linear regression on clusters of data

Question

Suppose I have 2 clusters of data: $\{(Y_{1i}, X_{1i})\}_{i=1}^{n_1}$ and $\{(Y_{2i}, X_{2i})\}_{i=1}^{n_2}$, and I'm interested in running a simple linear regression on each cluster.

I assume that

$$Y_{1i} = \beta_{10} + \beta_{11}X_{1i}+\epsilon_{1i}$$

$$Y_{2i} = \beta_{20} + \beta_{21}X_{2i}+\epsilon_{2i},$$

where $\epsilon_{1i}, \epsilon_{2i}$ have mean 0 given $X$. To estimate the intercept and slope coefficients, I can minimize the empirical squared error in the two clusters separately:

$$argmin_{\beta_{10}, \beta_{11}} \frac{1}{n_1}\sum_{i=1}^{n_1}(Y_{1i} - \beta_{10}-\beta_{11}X_{1i})^2$$ $$argmin_{\beta_{20}, \beta_{21}} \frac{1}{n_2}\sum_{i=1}^{n_2}(Y_{2i} - \beta_{20}-\beta_{21}X_{2i})^2$$

Now suppose I assume that the intercept and slope coefficients are identical between the two clusters, i.e., $\beta_{10} = \beta_{20} = \beta_0$ and $\beta_{11} = \beta_{21} = \beta_1$. Is this equivalent to running a single linear regression model on the pooled data? i.e., I would minimize: $$argmin_{\beta_{0}, \beta_{1}} \frac{1}{n_1 + n_2}\sum_{i=1}^{n_1 + n_2}(Y_{i} - \beta_{0}-\beta_{1}X_{i})^2$$

You are assuming that the two datasets can be described by the same model and than you are asking if you can fit them to the same model ? — J. Delaney, Feb 27 '22 at 12:21
Your question is poorly worded. The error terms are not specified completely. What are their variances? If they have different variances, then the MLEs of the pooled intercept and slope coefficients will not be minimized the shown least squares equation — user277126, Feb 28 '22 at 01:12
You need to further assume that the variances of the two error terms are also same if you want to use OLS on pooled data. — Dayne, Feb 28 '22 at 07:01
Agree with @Dayne's point. One of the assumptions of the Gauss-Markov theorem (which says the OLS estimators have the lowest variance amongst unbiased estimators) is that the error terms all have the same variance, so if the clusters have different variances that violates it. — David Veitch, Feb 28 '22 at 17:02
Very closely related: https://stats.stackexchange.com/questions/533857, https://stats.stackexchange.com/questions/12797, and https://stats.stackexchange.com/questions/13112. There are others. — whuber, Mar 03 '22 at 22:39
Your question is ambiguous and has two opposite and contradictory answers depending on how the ambiguity is interpreted. By "intercept and slope coefficients" do you mean what you literally wrote or do you mean their *estimates*? — whuber, Mar 03 '22 at 23:46
@whuber do you refer to the ambiguity about the potential difference between the distributions of the $\epsilon_{1i}$ and $\epsilon_{2i}$? — Sextus Empiricus, Mar 04 '22 at 09:49
@Sextus I mean the ambiguity between "coefficient" and "estimate," often expressed as the distinction between $\beta_i$ and $\hat\beta_i.$ The question uses the former notation but the answers (when I wrote that comment) interpret it as if it were the latter notation. — whuber, Mar 04 '22 at 13:32
@whuber I have difficulties interpretting the alternative *"Now suppose I assume that the intercept and slope coefficients are identical between the two clusters, i.e., $\hat\beta_{10} = \hat\beta_{20} = \hat\beta_0$ and $\hat\beta_{11} = \hat\beta_{21} = \hat\beta_1$*" But I would get something like the two estimates being similar in the sample distribution $\hat\beta_{10} \sim \hat\beta_{20}$. — Sextus Empiricus, Mar 04 '22 at 13:40
@Sextus Yes, I have trouble with that too, because it leads to trivialities. But, as I wrote, that appears to be the dominant interpretation among the posted answers. — whuber, Mar 04 '22 at 13:41
@whuber I see now what you mean. We could have a situation in which the $\beta$ are different, but we make a single estimate $\hat\beta$ for multiple (different) parameters. (This is not how I interpreted the other answers, I was more like thinking of the problems in the comment by user277126, and thought that you were indirectly referring to thos. But I see now that this alternative is also interesting) — Sextus Empiricus, Mar 04 '22 at 13:55
To me, this question reads roughly "is optimization in the pooled model equivalent to optimization in the unpooled model subject to the constraint that the parameters are equal across clusters?" Seems reasonable for a beginner. OP stipulates two "clusters" in the data and focuses on estimation under different assumptions. The other interpretation is the one that surprises me; that there even are "true coeffs" at all seems a strong assumption, unless it is known that the data were simulated from exactly such a linear model. — Gianni, Mar 05 '22 at 15:17

Gianni · Answer 1 · 2022-02-28T16:52:17.703

Short answer: yes.*

The first model you describe is a "no pooling" model where coefficients are treated independently. The second is a "complete pooling" model. [1]

You can rewrite the no-pooling model with a single expression: $\hat{y}_i = \mathbb{1}[c=1](\beta_{10} + \beta_{11} x_i) + \mathbb{1}[c=2](\beta_{20} + \beta_{21} x_i)$. [2]

Fixing the no-pooling betas to be equal, $\hat{y}_i = \mathbb{1}[c=1](\beta_0 + \beta_1 x_i) + \mathbb{1}[c=2](\beta_0 + \beta_1 x_i)$, which reduces to just $\hat{y}_i = \beta_0 + \beta_1 x_i$. [3]

* Edit: As other commenters have pointed out, your pooled model also implicitly assumes homoscedasticity. I read that as an accidental omission in your description, but without that assumption, your pooled model expression is indeed no longer correct.

[1] There are also "partial pooling" models that jointly estimate shared and independent terms for different data groupings / "clusters". While you haven't asked about that explicitly, partial pooling might be interesting for you to look into for your problem.

[2] In case you're not familiar, $\mathbb{1}$ is the indicator function. In the notation I'm using, $\mathbb{1}[c=1]$ is 1 when $c=1$ and $0$ otherwise.

[3] You can also see from writing the models out like this that there are further options: for example, you can allow any combination of the intercept, the slope, and the error term variance to vary with group.

The "yes" answer requires interpreting the question as positing that the coefficient *estimates* in the two clusters are identical, whereas--although that might be the intention of the questioner--what the question actually supposes, explicitly, is that the *true coefficient values* are the same. At the very least, then, we would expect the three sets of estimates (two individual and pooled models) to differ due to random variation. — whuber, Mar 03 '22 at 23:45
"Now suppose I assume that the intercept and slope coefficients are identical [...]" clearly indicates a modeling assumption to me. It's no way obviously, let alone "explicitly", about true coefficient values. In fact, the initial, explicit supposition "Suppose I have 2 clusters of data [...]" indicates that the "true" coefficient values are not necessarily the same. The context of the question further indicates interest in estimation difference under different models, not about model-data mismatch. While the question could be clearer, I think my interpretation is on mark. — Gianni, Mar 05 '22 at 14:35

Sextus Empiricus · Answer 2 · 2022-03-04T10:35:23.387

Is this equivalent to running a single linear regression model on the pooled data?

You are already running pooled data when you apply the sum for a single cluster. The equation

$$Y_{1i} = \beta_{10} + \beta_{11}X_{1i}+\epsilon_{1i}$$

can be seen as $n_1$ different clusters

$$Y_{1,1} = \beta_{10} + \beta_{11}X_{1,1}+\epsilon_{1,1} \\ Y_{1,2} = \beta_{10} + \beta_{11}X_{1,2}+\epsilon_{1,2} \\ Y_{1,3} = \beta_{10} + \beta_{11}X_{1,3}+\epsilon_{1,3} \\ \vdots \\ \vdots \\ Y_{1,n_1} = \beta_{10} + \beta_{11}X_{1,n_1}+\epsilon_{1,n_1} \\$$

Now you have $n_1 + n_2$ different clusters

$$Y_{1,1} = \beta_{0} + \beta_{1}X_{1,1}+\epsilon_{1,1} \\ Y_{1,2} = \beta_{0} + \beta_{1}X_{1,2}+\epsilon_{1,2} \\ Y_{1,3} = \beta_{0} + \beta_{1}X_{1,3}+\epsilon_{1,3} \\ \vdots \\ \vdots \\ Y_{1,n} = \beta_{0} + \beta_{1}X_{1,n}+\epsilon_{1,n} \\ \, \\ Y_{2,1} = \beta_{0} + \beta_{1}X_{2,1}+\epsilon_{2,1} \\ Y_{2,2} = \beta_{0} + \beta_{1}X_{2,2}+\epsilon_{2,2} \\ Y_{2,3} = \beta_{0} + \beta_{1}X_{2,3}+\epsilon_{2,3} \\ \vdots \\ \vdots \\ Y_{2,n_2} = \beta_{0} + \beta_{1}X_{2,n_2}+\epsilon_{2,n_2} \\$$

If the $\epsilon_{1,i}$ and $\epsilon_{2,i}$ are independent and have the same distribution*, then this is equivalent to a single cluster of $n_1 + n_2$ variables.

However it is not equivalent when the $\epsilon_{1,i}$ and $\epsilon_{2,i}$ have a different distribution/variance. In this case, you will perform some sort of weighted sum.

See How to combine two measurements of the same quantity with different confidences in order to obtain a single value and confidence . With the method in that link, if the case is that we estimate the variances of the two pools as being equal (up to a scaling with factors $X^TX$, $n_1$ and $n_2$) then the method will be the same as running a single linear regression model.

^{*Or even less strict if they have the same variance. You might be thinking of least squares regression without the $\epsilon$ being normal distributed and just care about the variance.}

score -1 · Answer 3 · answered Mar 03 '22 at 22:33

-1

This depends on exactly what you mean by "equivalent" in: Is this equivalent to running a single linear regression model on the pooled data

if you mean whether you would get the same results from running regression on pooled data. Then the answer is yes.

answered Mar 03 '22 at 22:33

jmcbon

1

1

It's not equivalent: the estimated error variance in the pooled model will differ from the estimates in either individual model. As a result, all p-values for tests will be different. Moreover, because of random variation, all three sets of parameter estimates will differ, too. Thus, it's important to state specifically *what* parts of the model you believe constitute "same results" and the sense in which they are the "same." – whuber Mar 03 '22 at 23:42
true. definitely comes down to what is meant by equivalent/same. I was just thinking about the coefficients being the same. – jmcbon Mar 04 '22 at 04:36

How to perform linear regression on clusters of data

3 Answers3