0

I feel I am pretty good on the mathematical basis of CLT and sampling distributions.

HOWEVER:

While sources like OpenIntro Statistics (Diez et al, 2019) are fairly straighforward with their actual use of sampling distributions, I am unable to find a single instance of a statistical analysis where a sampling distribution would be built and used as the underlying data. There are plenty of computational proofs that CLT and/or the sampling distributions do what they are supposed to be doing, but nothing to extent of "We have this data, we build a sampling distribution, we do these tests on it".

The questions are as follows:

  • Are sampling distributions of the mean/proportion/etc used in real life or are they just a basis for the assumption of normality.

           - For example, if you have data of the entire population and it is not normaly distributed          (say, very bimodal), do you build a sampling distribution for the statistic of interest and           work with it, or are non-parametric statistics become the only choice?
  • If sampling distributions ARE used in real life analyses, how are the sample sizes and the number of iterations selected. What stops me from taking a million large samples and reducing variance to the point where p-values become miniscule?
Pitouille
  • 1,506
  • 3
  • 5
  • 16
random_guy
  • 13
  • 3
  • When you say sampling distribution, this normally refers to the distribution of some statistic under sampling. Do you instead mean some assumed distribution for a sample? – Glen_b Oct 29 '19 at 05:50
  • @Glen_b, thank you for pointing this out. My inquiry is about the the sampling distributions of some statistic (mean, proportion etc.) – random_guy Oct 29 '19 at 05:54
  • You don't normally use sampling distributions "as underlying data" (that's what caused my previous question). You generally have one sample, and so one observation of a statistic like a mean or a correlation. There are some exceptions - e.g. you'll sometimes see biologists collect several observations and then average them and then analyze those averages. I try to dissuade them, but to little avail. Instead you more typically use a model for the population distribution from which the sample was drawn to identify a sampling distribution. ...ctd – Glen_b Oct 29 '19 at 06:50
  • ctd ... E.g. a psychologist might use an ExGaussian(/ExpGaussian) [model for reaction times](https://en.wikipedia.org/wiki/Exponentially_modified_Gaussian_distribution). Or we might use a lognormal distribution for household incomes or a Pareto distribution of firm sizes or an exponential distribution for interevent times or a Poisson distribution for some kinds of counts. Then the distribution of some sampling statistic becomes relevant (e.g. with exponential waiting times, the mean of an i.i.d sample of waiting times will have a gamma distribution.) – Glen_b Oct 29 '19 at 06:53
  • @Glen_b. Regarding the model of the population distribution, what happens when I have the data for the entire population (in my case, data of every single employee in a particular company) and the population is not normally distributed. Do I assume the existance of some normaly distributed super-population and run parametric tests on the bimodal blob of a data, or am I left with non-parametric statistics? – random_guy Oct 29 '19 at 07:05
  • Inference is for figuring out things about a population when you have a random sample from it. If you have the population *about which you want to find things out*, you have no need for inference, you already have all the population information. [Beware, however - often when people have *a* census of some population, it's not exactly the one they're trying to say something about. Very frequently people stop thinking about what they wanted to be able to say something about the moment they have nothing more they can sample.] – Glen_b Oct 29 '19 at 07:12
  • @Glen_b, you have inferred the essence of my question better that I was able to express it. Your comment puts an end to a week of headaches. Thank you. Setting to Answered as BurceET did technically answer my question. – random_guy Oct 29 '19 at 07:18

1 Answers1

1

Here are a couple of frequent uses of sampling distributions in applied statistics. My guess is that you may have seen one or both of them, possibly without realizing that sampling distributions are involved.

Suppose you have $n$ observations $X_1,X_2,\dots,X_n$ from a normal population $\mathsf{Norm}(\mu, \sigma),$ with $\mu$ and $\sigma$ unknown, and compute the sample mean $\bar X$ and the sample variance $S^2.$ Then $\bar X$ estimates the population mean $\mu$ and $S^2$ estimates the population variance $\sigma^2.$

One useful sampling distribution for confidence intervals and tests of hypothesis about $\mu$ is that $\frac{\bar X - \mu}{S/\sqrt{n}} \sim \mathsf{T}(n-1),$ Student's t distribution with $n-1$ degrees of freedom.

Another useful sampling distribution for confidence intervals and tests of hypothesis about $\sigma$ and $\sigma^2$ is that $\frac{(n-1)S^2}{\sigma^2} \sim \mathsf{Chisq}(n-1),$ the chi-squared distribution with $n-1$ degrees of freedom.

For example, if a sample of size $n = 15$ from a normal population has $\bar X = 53.2$ and $S^2 = 17.3,$ then one can use the first of these sampling distributions to find the 95% confidence interval $(50.9,\,55.5)$ for $\mu.$ [Computation using Minitab statistical software.]

One-Sample T 

 N   Mean  StDev  SE Mean      95% CI
15  53.20   4.16     1.07  (50.90, 55.50)

Also, one can use the second of these sampling distributions to find the 95% confidence intervals $(3.05, 6.56)$ for $\sigma$ and $(9.3, 43.0)$ for $\sigma^2.$

CI for One Variance 

Method

The chi-square method is only for the normal distribution.

Statistics

 N  StDev  Variance
15   4.16      17.3

95% Confidence Intervals

               CI for        CI for
Method          StDev       Variance
Chi-Square  (3.05, 6.56)  (9.3, 43.0)
BruceET
  • 47,896
  • 2
  • 28
  • 76
  • Thank you. With regards to your second paragraph. What if I have the data for the entire population (e.g. all employees in a company) and I know for sure that the distirbution is not normal, do I treat treat my data as a part of an assumed super-population and continue with parametric methods, or should I start looking for non-parametric alternatives? – random_guy Oct 29 '19 at 06:53
  • 1
    @random_guy if your data is population, you don't need things like confidence intervals, hypothesis tests, etc. You know the estated results for certain, there's no uncertainty. See https://stats.stackexchange.com/q/2628/35989 – Tim Oct 29 '19 at 20:56
  • 1
    What Time wrote is correct, though I do tend to be skeptical about claims of having the entire population. – Dave Sep 02 '21 at 16:23