2

Suppose two samples A and B of size N = 20 are taken from a population of pairs $(Y_i, X_i)$, and separate OLS regressions are calculated from each sample for the model: $$Y_i=\beta_1+\beta_2X_i+\varepsilon_i$$

Considering say the slope coefficient, this yields two estimates $\hat\beta_{2A}$ and $\hat\beta_{2B}$ and their associated standard errors $s_{2A}$ and $s_{2B}$. It is possible then to apply a t-test of a null hypothesis that I provisionally state as follows:

$H_0$: The difference between the mean of the sampling distribution of the estimate of $\beta_2$ associated with sample A and the mean of the sampling distribution of the estimate of $\beta_2$ associated with sample B is less than $\Delta$.

Here $\Delta$ is the size of difference that would be of practical concern given the purpose of the model (I've been guided here by the answers to this question).

Since the standard errors are likely to differ, the appropriate test appears to be Welch’s t-test, the test statistic being (assuming $\hat\beta_{2A} > \hat\beta_{2B}$):

$$t=\frac{\hat\beta_{2A}-\hat\beta_{2B}-\Delta}{\sqrt(s_{2A}^2+s_{2B}^2)}$$

I can see how to do the calculation, and see in general terms that if the null hypothesis were to be rejected, that might suggest departures from the assumptions of the classical linear regression model or non-randomness in sample selection. However, I am puzzled as to what exactly is tested by such a t-test.

Question: How could my formulation of $H_0$ be improved to indicate more precisely what this application of a t-test actually tests? As formulated, it sems to assume that it makes sense to refer to the sampling distribution of a particular sample. This seems wrong because a sampling distribution (of a statistic such as the mean) is a property of repeated samples, not of just one sample. If the two samples had been drawn from distinct sub-populations, then it would make sense to refer to the sampling distributions of the means for each sub-population. But this is not the case here: both samples are drawn from the whole population.

Adam Bailey
  • 1,602
  • 11
  • 20
  • Division by $N-2$ in $t$ is incorrect: the standard error of $\hat{\beta}_{2A} - \hat{\beta}_{2B}$ is just $\sqrt{s^2_{2A} + s^2_{2B}}.$ Your formulation of $H_0$ otherwise looks fine: it's hard to imagine what else would go into a good answer apart from a discussion of the assumptions underlying OLS, which would be redundant in light of many such threads on this site. – whuber Oct 14 '13 at 19:54
  • @whuber Thanks, I've edited the question to remove the division by $N-2$ and hopefully clarify why I perceive a problem with $H_0$. Also I realise there was a possible ambiguity in my formulation of $H_0$ which I've now removed. – Adam Bailey Oct 15 '13 at 09:38
  • 1
    This question looks very uncommon. Under usual assumptions, the estimators of the regression parameters are unbiased. Thus the means of the sampling distributions of them are just the parameters themselves. So an equivalent formulation of your null hypothesis is $H_o: |\beta_{2A}-\beta_{2B}| \le \Delta$. This hypothesis could be tested via confidence interval for true interaction with grouping variable (A vs. B) using a single regression. – Michael M Oct 15 '13 at 09:56
  • @MichaelMayer Could you explain please what $\beta_{2A}$ means in your formulation? $\hat\beta_{2A}$ has a clear meaning here - it's the estimated value, based on sample A, of the population coefficient $\beta_2$. It can't be an estimate of $\beta_2$ for a sub-population as sample A is drawn from the whole population. There is no sub-population coefficient here for $\beta_{2A}$ to represent. – Adam Bailey Oct 16 '13 at 07:29
  • $\beta_{2A}$ and $\beta_{2B}$ represent the expected values of the estimators $\hat \beta_{2A}$ and $\hat \beta_{2B}$, i.e. the means of the sampling distributions of those estimators. If A and B were sampled in a quite similar manner, we would have $H_o: |\beta_{2A} - \beta_{2B}| \le \Delta$. If you can reject this $H_o$ based on your data, you could be confident that the two samples were not sampled in a similar manner. Does this make sense? – Michael M Oct 16 '13 at 09:15
  • @MichaelMayer Thank you. So in words, the null hypothesis should be this: the difference between the mean of the values of $\hat\beta_{2A}$ over all possible samples obtained in the manner of sample A and the mean of the values of $\hat\beta_{2B}$ over all possible samples obtained in the manner of sample B is no more than $\delta$? – Adam Bailey Oct 17 '13 at 07:03
  • Yes. I have to confess though that this is equivalent to the null in your post :-). – Michael M Oct 17 '13 at 15:53
  • @MichaelMayer No need to "confess"! You've helped me a lot by leading me to a precise interpretation of my vague "associated with". – Adam Bailey Oct 18 '13 at 08:01

0 Answers0