3

In this post, one of the answers provides the following information about the assumptions of linear regression in the case of random design (as opposed to fixed design):

The usual regression model is $Y=X\beta+\varepsilon$ and the assumpitons are:

  • $E[\varepsilon|X]=0$
  • Homoscedasticity, $E[\varepsilon^2|X]=\sigma^2$
  • No serial correlation, $E[\varepsilon_i,\varepsilon_j|X]=0$

Note that there is a subscript on the error terms in the last assumption. Does this mean this assumption is not relevant to the population variables, but only to the sample data?

To make it clearer what I'm asking, consider the linear regression model with respect to population variables vs sample variables. Denote by $X^p$, $Y^p$, and $\varepsilon^p$ the population variables for which the regression model holds: $$ Y^p = f(X^p) + \varepsilon^p. $$

Consider drawing $n$ samples from the population $(X_1^s,Y_1^s),(X_2^s,Y_2^s),\dots,(X_n^s,Y_n^s)$. The linear regression model applied to the samples is $$ Y_i^s = f(X_i^s) + \varepsilon_i^s, \quad \quad i=1,2,\dots,n. $$ where the errors are $\varepsilon_i$ all have the same CDF as the population error $\varepsilon^p$. In the samples we have multiple error observations $\varepsilon_i$ so the 'no serial correlation' assumption makes sense. But for the population model, we just have a single random variable for the error $\varepsilon^p$, so it seems we can't speak of 'no serial correlation' with respect to the population?

I have assumed that we can frame the linear regression model with respect to both the sample data and the population we have drawn the sample from. However, if I have made some fundamental mistake in drawing a distinction between the regression model applied to the population vs the sample please let me know, I am still getting to grips with the details of statistical models.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
ManUtdBloke
  • 743
  • 4
  • 13
  • 4
    The sense of "serial correlation" is ambiguous here: does it mean serially *in the order the data were collected,* as it often does in such circumstances, or does it mean serially *in an order established by one of the regressor variables,* as it does in time series analysis (for instance)? The latter would have a clear meaning in the population but the former (obviously) would not. – whuber Sep 25 '20 at 14:32
  • 1
    "The usual regression model is Y=Xβ+ε and the assumpitons are..."---the "usual assumption" for large sample inference for the linear regression model does not require no serial correlation (not at all, really), or conditional homoskedasticity for that matter. "Does this mean this assumption is not relevant to the population variables, but only to the sample data?"---the stated condition, $E[\varepsilon_i,\varepsilon_j|X]=0$, is a population assumption. – Michael Sep 26 '20 at 01:31
  • @whuber I am interested in standard linear regression, not time series analysis. – ManUtdBloke Sep 28 '20 at 08:55
  • @Michael Serial correlation and conditional homoskedastiticy are listed as required 'assumptions for observational research ' on page 532 of Encylopedia of Research Design by Neil J. Salkind. – ManUtdBloke Sep 28 '20 at 08:58
  • @Michael It is hard to see how $E[\varepsilon_i,\varepsilon_j|X] = 0$ is a population assumption when we have subscript on the error terms which signify multiple different error random variables, whereas there is only random variable for the error associated with the population? – ManUtdBloke Sep 28 '20 at 09:06
  • 2
    @ManUtdBloke, if you can index the elements of the sample by $i$ and $j$, why would you think you cannot index the elements of the population accordingly? – Richard Hardy Sep 28 '20 at 10:05
  • 1
    The reference to "serial correlation" refers to a post in which this term is misused. Because the implicit quantifiers in that post are *all* distinct pairs $i,j,$ this assumption has nothing to do with correlation: it simply means that no two errors are correlated. – whuber Sep 28 '20 at 13:07
  • @Michael Because for the samples we have distinct instance for which the errors can be indexed by $i=1,2,\dots,n$. But there is no notion of distinct instances for the population, we have a single population with a single error variable (which doesn't have a subscript). – ManUtdBloke Sep 29 '20 at 16:01
  • @whuber The serial correlation assumption which features in that post and which I have used here is taken directly from p. 532 of Salkind's book 'Encylopedia of Research Design' and he does indeed call this assumption 'No serial correlation'. – ManUtdBloke Sep 29 '20 at 16:03
  • @ManUtdBloke, the indexing is for elements of a single sample, not for different samples. Therefore, if you can index a sample, you can index a population just as well. – Richard Hardy Sep 29 '20 at 16:20
  • 2
    In that case his language is incorrect! Serial correlation is clearly and authoritatively defined in most time series textbooks, where it concerns the structure of correlations among a set of random variables that are *ordered along a line.* That's literally what "serial" means. Indeed, in the multiple regression setting (which Salkind appears to be considering on that page) there is no natural ordering of the data, making the term "serial" meaningless and therefore superfluous. – whuber Sep 29 '20 at 19:07
  • @RichardHardy Regression is based on drawing $n$ samples from a single population. The indexing refers to each sample, e.g. $Y_i = f(X_i) + \varepsilon_i$ for $i=1,2,\dots,n$ is stating that this relationship holds for each of the $n$ samples. The first sample corresponds to $(X_1,Y_1,\varepsilon_1)$, the second corresponds to $(X_2,Y_2,\varepsilon_2)$, and so on. – ManUtdBloke Oct 06 '20 at 14:16
  • @whuber Thanks for the clarification, it is hard to know what literature to trust yet as I'm still only in the process of transitioning into statistics. – ManUtdBloke Oct 06 '20 at 14:18
  • 1
    The notion of sample in statistics is different from the one you are using. A sample is a subset of $n$ elements of the population, and usually $n>1$. If you have a dataset with $n$ observations, this is your sample. – Richard Hardy Oct 06 '20 at 14:30
  • @RichardHardy Ah ok, I can see where the misunderstanding arose now. – ManUtdBloke Oct 07 '20 at 11:33
  • Great, that is a relief! – Richard Hardy Oct 07 '20 at 12:12

0 Answers0