Confusion about the sample distribution.. Can you please enlighten me?

Question

I thought that the sample distribution was an approximation of the distribution of the underlying phenomenon.

But then the book says:

We will denote the sample size by $n$ ($n \le N$) and the values of the sample members by $X_1, X_2, \dots , X_n$. It is important to realize that each $X_i$ is a random variable. In particular, $X_i$ is not the same as $x_i$ : $X_i$ is the value of the $i$-th member of the sample, which is random and $x_i$ is that of the $i$-th member of the population, which is fixed.

I don't understand this distinction. I thought that also $x_i$ should be considered as random; after all, they are all realizazion from an underlying probability distribution. So even the population mean $\mu = \frac 1N\sum x_i$ must be seen as a random variable.

Then I realized we were talking about different experiment (ie the $x_i$ will be considered random when the population is created (so to speak) while will be considered constant and fixed in the contest of the survey we are performing).

Take a look at $Var \ \bar X = \frac{\sigma^2}n\left(\frac{N-n}{N-1}\right)$ If $N=n$, it implies $Var \ \bar X = 0$, that is, if we interview all the population we will find that $\bar X$ is really a constant ($= \mu$).. this brings me to the question:

The sample distribution then is the distribution of what? Apparently isn't the distribution of the underlying phenomenon, but the distribution that arise as having $x_i$ realization and taking a random $n$ between them.. without asking where the $x_i$ come from.

So the sample distribution must be used only as a measure of the accuracy of $\bar X$ to estimate $\mu$, but neither $\bar X$ nor $\mu$ can be seen as estimates of what we really want, that is the $E(Y)$, where the distribution of $Y$ (the underlying distribution for all the population) is what we care about.

I suppose one can do pretty much the same reasoning and conclude that $\mu$ is an unbiased estimator for $E(Y)$ and maybe try to derive it's variance, but then it's not clear to me how to connect all of this with what we actually have ($X_i$).

Also, I think this point should be made more explicit (if it's correct, that is), because it was a source of confusion for me.

Please have a look on this recent question: http://stats.stackexchange.com/questions/141416 I don’t get your point of $\overline X$... — Elvis, Mar 14 '15 at 15:39
@Elvis Okay your answer seems to confirm my point about the sample distribution. My problem is that it's not clear to me how to go from $\bar X$ to $E(Y)$.. How to infer the *true* mean of the underlying probability distribution? To continue your example, when born every Parisinian will have a certain pre-determined height, so his height can be regarded as random variable ($Y$) with a certain probaility distribution.. Shouldn't we care about this probability distribution (and about $E(Y)$) rather that the population mean? — Ant, Mar 14 '15 at 15:44
We might want another example as height is not pre-determined at birth... but let’s admit it. If it were the case, there would be no difference between the pre-determined height at birth and the measured height at adult age. Sampling adults and measuring them is a good way to make inference about the distribution of [pre-determined] height. — Elvis, Mar 14 '15 at 15:50
@Elvis Right. But this inference is a different one than the one we make when we talk about sampling distributions, right? So how one would infer the distribution of pre-determined height by sampling adults? — Ant, Mar 14 '15 at 15:52
I think your (legitimate) problem lies in the fact that in this kind of example the total population is finite. When you sample from a finite population, there is some randomness in the sample but the total population is fixed. It is always a problem to tell something like *the distribution of the height of male Parisians is $\mathcal N(175,15^2)$* — which is a continuous distribution, while the true distribution in this experiment is a discrete one. — Elvis, Mar 14 '15 at 15:57
@Elvis I see. So you're saying that telling something about the underlying distribution (is very difficult / it's a completely different topic). Is that correct? — Ant, Mar 14 '15 at 16:00
Yes, this use of a normal distribution has to be seen as a *useful* approximation, nothing more. In many practical cases we approximate discrete distributions by (simple) continuous distributions. If you want to know exactly the *true* distribution of the Parisians’ height, you need to measure all Parisians. If you satisfy yourself with an approximation by a normal distribution, or by a mixture of normals... a few hundred (say) individuals will be enough to provide an excellent approximation. — Elvis, Mar 14 '15 at 16:03
@Elvis I'm sorry to bother you this much, but if I measure the height of all Parisinians, how do I infer their *true* distribution? I think this is my main problem. Is there an answer to this question? (of course making an histogram kinda works, but I am more interested in the theoretical side ) — Ant, Mar 14 '15 at 16:06
Oh, ok, I got your question. Let denote by $h_1, \dots, h_N$ the measured heights. There are two distributions to consider: 1) the distribution of the height $H$ of a (uniformly) random Parisian. For all $x$ denote $n(x)$ the number of $h_i$’s equal to $x$. The distribution of $H$ is $\mathbb P(H = x) = n(x)/N$. — Elvis, Mar 14 '15 at 16:10
and 2) the distribution of "all Parisians sizes". You measured it by the point $(h_1, \dots, h_N)$ in $\mathbb R^N$, or rather in a more complicated space where the order of the heights does not matter. As the distribution is a fixed point, you can consider it as being a Dirac distribution, this is the point you measured, with probability $1$. — Elvis, Mar 14 '15 at 16:14
Because this is not a real random experiment. If you measure all inhabitants twice, you will find the same point twice... — Elvis, Mar 14 '15 at 16:16
@Elvis Right. But again, it doesn't tell you about the true distribution of height, does it? Let me ask you this question: My wife is about to give birth and we would like to know that's the expected height for our son. We are also interested in the variance of this expected height. One very natural way to answer would be, respectively, population mean and population variance. I am trying to understand why population mean and population variance can be regarded as useful (and unbiased?) estimates for the expected height and variance. thank you for your time! :-) — Ant, Mar 14 '15 at 16:20
OK! You can consider that the size of the individuals is taken in an infinite continuous distribution, and that the (single) random experiment considered is "giving birth to all Parisians". Then all tools of statistics apply! You don’t have access to the *true* infinite continuous distribution, but — for example — if you assume it has finite mean and variance (which, regarding human heights, is a fairly reasonable assumption), the central limit theorem tells you lots about how the (unknown) expected value of this distribution is estimated by a sample. — Elvis, Mar 14 '15 at 16:50
@Elvis Uhm.. Okay I think I am starting to understand, I'm also re-reading the comments.. If you want to make a short answer summarizing what you've written in the comments I'll gladly accept it! :) — Ant, Mar 14 '15 at 16:51
@Glen_b the book is "Rice J.A. Mathematical statistics and data analysis".. why? :-) — Ant, Mar 15 '15 at 15:03
The definitions of $X_i$ as a sample value and $x_i$ as a population value are quite unusual. I wanted to check the original if I could and maybe see if some additional context threw more light on it. — Glen_b, Mar 15 '15 at 15:06
Specifically, since the population is larger than the sample, $i$ would have to mean something different in the two expressions. More usually, $X_i$ and $x_i$ are connected. — Glen_b, Mar 15 '15 at 15:21
@Glen_b Ah, I see. I am pretty sure they are not but if you find the book and I am wrong please tell me! :-) — Ant, Mar 15 '15 at 16:33
I found the quote itself from google books (looks like what you have is correct) but it didn't give me any wider context than exactly what you quoted. — Glen_b, Mar 15 '15 at 16:39

score 2 · Accepted Answer · answered Mar 15 '15 at 14:34

So apparently, after discussion... the question is about the fact that the value taken by a parameter in the population being sampled comes itself from "somewhere". A classical way to deal with this is to decide that these values are taken in an unknown distribution, which is not physically realized in a larger population ; it is rather a distribution of potential values.

Then all tools of statistics apply: We don’t have access to the true infinite continuous distribution, but — for example — if we assume it has finite mean and variance (which is often a reasonable assumption), the central limit theorem tells us lots about how the (unknown) expected value of this distribution is estimated by a sample.

Confusion about the sample distribution.. Can you please enlighten me?

1 Answers1