1

enter image description here

I have a strange distribution in my population. I created this distribution for the purpose of my question, but let's pretend we do know much about it.

Anyway, there are 6 random variables from 0 to 5. 5 has a frequency of 25%.

But now let's get to the problem.

I wanted to calculate the z score for X>4.

What I did:

I took a sample size of 200 from the population. I calculated the mean, which was: 2.61 I calculated std from the sample, which was 1.79

and I went calculating z score using central limit theorem formula:

zscore = (4-2.61)/(1.79/square root of 200)
zscore = 10.92

I am surprised by such a big z score. How I can interpret this? As far as I understand it, it tells me value 5 is 10.92 standard deviations away, it has practicaly no proability to happen according to central limit theorem, but if we look at original population it happened in aprox. 25%.

Stenga
  • 241
  • 2
  • 10
  • 1
    The distribution of the mean is much more concentrated than the one of a single observation. – Michael M Sep 29 '19 at 16:57
  • 1
    Use your intuition: just what fraction of all samples of size 200 from this population will have means $5$ or larger? (It's actually easy to compute an exact answer if you like.) – whuber Sep 29 '19 at 16:57
  • @MichaelM Sorry, what do you mean? – Stenga Sep 29 '19 at 17:03
  • @whuber Zscore of centreal limit theorem calculates mean value? – Stenga Sep 29 '19 at 17:04
  • I think your statement: `I wanted to calculate the z score for X>4.` may be the core of the problem. Your Z-score would use the CLT to approximate $P(\bar X_{200} > 4), $ which is very small indeed. (About $4.6 \times 10^{-28}.)$ By contrast, for any one observation $X_i,$ one has $P(X_i > 4) = 0.25.$ – BruceET Sep 29 '19 at 20:39
  • See https://stats.stackexchange.com/questions/3734. – whuber Sep 30 '19 at 14:16

1 Answers1

2

Comment continued: Here is a simulation of sample means $A = \bar X_{200}$ from 100,000 samples of size 200 from your population.

set.seed(929)
a = replicate( 10^5, 
  mean(sample(c(1,2,3,5), 200, rep=T, p=c(.2,.05,.5,.25))) ) 
mean(a > 4)
[1] 0
hist(a, prob=T, br=20, xlim=c(0,5), col="skyblue2")

There was not even one instance, among the 100,000 samples, of a sample mean exceeding $4.$ Here is a histogram of the simulated sample means.

enter image description here

BruceET
  • 47,896
  • 2
  • 28
  • 76