What happens if I use OLS in a multiple regression but the sample is not random?

Question

I know that, to use OLS estimators in linear regressions, there are few assumption to be satisfied. However, it is not clear to me what would happen if I would use OLS in a multiple regression without having a random sample, so that (Xi, Yi) would not be iid. Which sort of problem may I face?

A good place to look is [this](http://econweb.ucsd.edu/~jhamilto/hp.pdf) paper. Section 6 to end (the first five sections are not that relevant to your question) — user603, Oct 31 '16 at 16:34
@user603 I think my econometric background is too limitated to understand the contents of that paper. Nothing simpler? — bobo55, Oct 31 '16 at 17:02
@jchaykow, how would you define model bias? I know what a biased parameter estimator is, but not quite sure about a biased model. — Richard Hardy, Oct 31 '16 at 18:11
@RichardHardy a model that is trained on biased data will overfit certain non-representative subset of the overall population. — conv3d, Oct 31 '16 at 20:02
@jchaykow, *biased data* is yet another notion without a common definition, but let it be. I got your point. — Richard Hardy, Oct 31 '16 at 20:22
@RichardHardy I see what you mean, I'm being loose with my language I think. — conv3d, Oct 31 '16 at 21:19

score 2 · Answer 1 · answered Oct 31 '16 at 17:04

2

First, OLS is nothing more than an algorithm for fitting a linear model of the form $$ y = \mathbf{X\beta} + \epsilon $$ In other words, you are positing that the phenomenon $y$ is a linear function of the variables $\mathbf{X}$, plus some additively separable disturbance term.

If this is a good assumption, then there is some true, constant $\mathbf{\beta}$, and you apply some estimator -- such as OLS -- to estimate what it is.

If your sample is non-random -- there is some correlation between your $\mathbf{X}$'s and your error term -- then OLS estimates of $\mathbf{\hat\beta}$ will not be equal in expectation to the true $\mathbf{\beta}$. This is to say that they are biased.

In other words, if you were to take many many samples from the population of $\mathbf{X}$ and $y$, your average $\mathbf{\hat\beta}$ would not equal $\beta$.

answered Oct 31 '16 at 17:04

generic_user

11,981
8
40
63

1

I think non i.i.d. does not imply dependence (let alone correlation) between $X$ and $\epsilon$, although it does not prohibit it either. E.g. you could have serially correlated errors that are independent of $X$ and still call it non-i.i.d., no? – Richard Hardy Oct 31 '16 at 17:23
Yeah, that's right. But the title of the question says "non-random sample." It is possible that OP has longitudinal data, and just needs to correct the standard errors. The question is somewhat unclear. – generic_user Oct 31 '16 at 17:56
I will try to explain better. I would like to know, theoretically, what would happen if I would try to estimate OLS having a non random sample, violating the assumption of iid. I think that if iid is satisfied, then the expected value of β^ would be equal to the real β, otherwise it would be not. But I am not sure – bobo55 Oct 31 '16 at 18:04
@bobo55, take a look at the OLS assumptions to make sure you got those correctly. See e.g. [this thread](http://stats.stackexchange.com/questions/16381/what-is-a-complete-list-of-the-usual-assumptions-for-linear-regression/16460#16460), especially the first answer. – Richard Hardy Oct 31 '16 at 18:07
It seems to me that I can say that, if I have a nonrandom sample, OLS estimator would be biased and inconsistent. In particular, it is not true that E[β^] = β. – bobo55 Oct 31 '16 at 19:05
1

Clarify what you mean by nonrandom. If you mean that you've got repeated observations of individuals, and those individuals are autocorrelated in time, your estimate of beta will still be unbiased, though inference is complicated. If your sample is nonrandom because some unobserved factor is both correlated with your X's and y, beta will be biased. – generic_user Oct 31 '16 at 20:17
Likewise if Your data exhibits heteroskedasticity, your estimates will still be unbiased, just inefficient. – generic_user Oct 31 '16 at 20:17
You're right. Using nonrandom, I don't mean repeated observations but observations that own some characteristics, eg individuals taller than 1.80m. – bobo55 Oct 31 '16 at 21:15

score 0 · Answer 2 · answered Jun 23 '20 at 17:20

When the sample is not random, you have to consider whether the way you got the sample introduced bias. That is, the way data was gathered IRL can affect the extent to which the sample is representative of the population.

For example, say you want to predict who someone is going to vote for based on their media habits. You get the data from asking your friends. The problem is that your friends are probably not going to be representative for the population at large.

Why? One reason could be that we tend to become friends with people who share similar media preferences (maybe you became friends partly because you both love the same youtube channel). Another could be that friends tend to have the same socioeconomic status, and socioeconomic status affects which types of media that are consumed.

In this case, when you do your OLS, your regression coefficient will reflect your friends, but it's very hard to say whether it reflects the population at large. If you're only interested in your friends, that's fine. If you want to generalize, you're in trouble.

According to (Mercer et al., 2017), for non-random samples you basically have to consider whether your non-random sample reflects the population... in terms of confounders.

For example, if all your friends have the same gender as you, sampling your friends is going to be a problem because gender likely affects media habits.

But an all male sample might not be a problem. E.g. if you're testing a new pill for erectile dysfunction, you're probably OK going with an all male sample. Basically, it depends on the theoretical knowledge of your field.

When we start to appeal to the theoretical knowledge of our field, we are moving out of statistics and into the world of causal inference (see e.g. Pearl's Book of Why).

Random sampling (of the population) is a way to not have to deal with any of this. With random sampling (from the population) you can say "I know the way I gathered the data didn't introduce bias, because it was at random".

Randomization protects you from the known, the unknown, and the unknown unknown sources of bias.

What happens if I use OLS in a multiple regression but the sample is not random?

2 Answers2