3

The problem gives 50 observations and then asks to check whether the data comes from a normal distribution.

For instance if these are your observations then how would you go about testing to see if they come from a normal distribution.

−1.28 −1.22 −0.45 −0.35 0.72
−0.32 −0.80 −1.66 1.39 0.38
−1.38 −1.26 0.49 −0.14 −0.85
2.33 −0.34 −1.96 −0.64 −1.32
−1.14 0.64 3.44 −1.67 0.85
0.41 −0.01 0.67 −1.13 −0.41
−0.49 0.36 −1.24 −0.04 −0.11
1.05 0.04 0.76 0.61 −2.04
0.35 2.82 −0.46 −0.63 −1.61
0.64 0.56 −0.11 0.13 −1.81

The professor talked something about transforming it into a uniform distribution and then having 4 intervals 0,0.25, 0.5,0.75,1 and then see how many values fall within those intervals but I'm not sure how to go about doing that.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
user52760
  • 147
  • 1
  • 3
  • 8
  • 2
    Well, you could convert to a uniform (via the probability integral transform) if $\mu$ and $\sigma$ are specified. If estimated from sample statistics, the result is not actually uniform. – Glen_b Sep 03 '14 at 06:15
  • Is this a case where you're testing for a standard normal? Where do these numbers arise from (are they standardized residuals, for example)? Why are you testing for normality, and of all things, why a chi-squared test? The power is quite low, on 50 observations it has to look terrible to get a rejection. – Glen_b Sep 03 '14 at 06:16
  • This problem is from Schverish 4th edition probability and statistics textbook. The questions asks to test . Carry out a χ 2 test of goodness-of-fit by dividing the real line into five intervals, each of which has proba- bility 0.2 under the standard normal distribution. http://i.imgur.com/wjIZ4dC.png So should I just use the sample mean and sample standard deviation to carry out the calculations. – user52760 Sep 03 '14 at 06:37
  • No, you should not use the sample mean and sd. You should have included this information (though not the image, since no doubt the book is under copyright) in your question. What's the name of the text? Why do your numbers not match the numbers in the text you linked? – Glen_b Sep 03 '14 at 06:41
  • Probability and Statistics (4th Edition) by Morris H. DeGroot (Author), Mark J. Schervish That is an example in the section that seems similar to the problem. – user52760 Sep 03 '14 at 06:43
  • 2
    The image you posted shows how to obtain intervals for a given (population, not sample) mean and standard deviation. – Glen_b Sep 03 '14 at 06:46
  • Yes. I was wondering if I could do the same thing with this problem using the sample mean and standard deviation but as you stated it, you cannot. So I'm stuck as what to do instead. – user52760 Sep 03 '14 at 06:49
  • How are you presently finding normal probabilities, critical values or p-values? Tables? Computer program? Something else? – Glen_b Sep 03 '14 at 07:56
  • 2
    Does the question mention testing for a standard normal? Or just general normality? – Glen_b Sep 03 '14 at 10:30
  • The question does not mention testing for a standard normal. It's just raw data. Apparently the textbook wants you to calculate the MLE's and use those to create the intervals and carry out the calculations. I guess it becomes a chi-square distribution even with the MLEs. – user52760 Sep 03 '14 at 21:14
  • 1
    No, it really doesn't become chi-squared in that case; this is quite well known (and you don't even need to be able to follow the reasoning, since it's easily seen via simulation). Or rather, it's chi-square if you used the grouped data not the original data to estimate the parameters, but even your boundaries are based on parameter estimates. I'll add some details on what *can* be done. – Glen_b Sep 03 '14 at 22:31
  • 1
    I've added some details now and gone through calculations, with discussion and suggestions. – Glen_b Sep 03 '14 at 23:56
  • Thank you for the thoughtful and thorough response. I really appreciate it. – user52760 Sep 04 '14 at 08:34

1 Answers1

7

There are several issues here that should be mentioned:

i) chi-squared tests of goodness of fit tend to have very low power. There are much better alternatives

ii) if you don't have a completely specified distribution, but the parameters are estimated efficiently on the full data set (rather than the grouped data) - for example, if we're looking at residuals in an ANOVA - then the chi-square statistic doesn't actually have a chi-square distribution.

iii) if the point is to assess the assumptions of some other procedure, formal goodness of fit testing is not really answering the right question.


If it's made sufficiently clear what we're testing - (a) raw data or residuals, and (b) if raw data whether it's with specified parameters or estimated parameters - then I can describe how to go about it.

  1. Population parameters are specified in advance

    If the parameters are all specified in advance, then you divide the data into bins. Your linked extract indicated one way to do this, which is to divide the data up into equal sized bins via the inverse cdf.

    $F^{-1}(p)$ ($\,=\Phi^{-1}(p)$ for a standard normal) cuts off $p$ of the probability to its left. So by taking $p_i = 0.25,0.5,0.75$ you'd get 4 bins. In this case (n=50) slightly more bins would be better (say $k=6$, or perhaps $8$), but the idea of taking regions so that you get equal or near equal expected counts is good.

    Then count the number of observations in each bin, $O_i, i=1,2,...,k$. The expected count $E_i=np_i$ (which $=n/k$ when the proportions are equal).

    Then you compute $\sum_{i=1}^k \frac{(O_i-E_i)^2}{E_i}$ and this will have an approximately chi-squared distribution with $k-1$ degrees of freedom; you lose 1 because the expected total is set equal to the observed total. (So with 4 groups, that's 3 d.f.)

  2. Population parameters are (efficiently) estimated from the grouped data

    In this case, you don't have the raw data - it's already been grouped before the parameters were estimated (such estimation via MLE is often not simple - EM is sometimes used, but minimum chi-square estimation will work quite well for this purpose and may be easier to carry out).

    The $p_i$ and hence expected values are calculated based on those estimated parameters.

    Calculation then proceeds much as before, except you lose one additional degree of freedom per parameter estimated. So in this case, with 4 groups and 2 parameters, you'd have 1 d.f. left.

  3. Parameters are efficiently estimated on the ungrouped data

    This includes the case where the data have been standardized by the sample estimates. This works much as case 2. above, except that you no longer have an asymptotic chi-square distribution for the test statistic under the null hypothesis.

    This has long been known, but many elementary treatments of the chi-squared goodness of fit test seem completely ignorant of it. See the excellent discussion here for some illustration and further explanation.

    However, the distribution is bounded between $\chi^2_{k-1}$ and $\chi^2_{k-p-1}$ where $p$ parameters are estimated, so you can get bounds on p-values. Alternatively, you can perform simulation to work out either critical values or p-values.

    If the group boundaries are estimated from the data, this apparently doesn't affect the asymptotic distribution (so if $n$ is large you can pretty much ignore that, though if you're basing your null distribution off simulation, you'd be able to include any effect it might have, even in small samples)


So now, to your data. I assume we're in situation 3. I'll pull your data into R and do the calculations step-by-step (in the way one might do if hand calculation were ever necessary). I scanned your data into y:

> mean(y)
[1] -0.1646
> sd(y)
[1] 1.16723

Find bin boundaries (proceeds as in the example in your linked document):

> mean(y)+sd(y)*qnorm((0:4)/4)
[1]       -Inf -0.9518845 -0.1646000  0.6226845        Inf

Here -Inf (=$-\infty$) is the left 'limit' of the lower bin and Inf ($\infty$) is the right limit of the upper bin.

Count how many in each bin:

> table(cut(y,mean(y)+sd(y)*qnorm((0:4)/4)))

  (-Inf,-0.952] (-0.952,-0.165]  (-0.165,0.623]    (0.623, Inf] 
             14              11              14              11 

Let's put that in something:

> Oi = table(cut(y,mean(y)+sd(y)*qnorm((0:4)/4)))

The expected number in each bin, $E_i=np_i$ is $n/k$ = 50/4 = 12.5.

> Ei = 50/4

Calculate the statistic:

> sum((Oi-Ei)^2/Ei)
[1] 0.72

and this has a distribution function that lies between a $\chi^2_1$ and $\chi^2_3$. There's no need to proceed, since the statistic is smaller than the expectation for 1.d.f (it simply won't reject, since the p-value is already more than 0.3). But let's calculate anyway:

> pchisq(sum((Oi-Ei)^2/Ei),1,lower.tail=FALSE)
[1] 0.3961439
> pchisq(sum((Oi-Ei)^2/Ei),3,lower.tail=FALSE)
[1] 0.86849

So the p-value is above 0.40 but below 0.87.

However, since we could only be sure we could reject if the higher number was below the significance, level, we can just plug the $O_i$'s into a chi-square test:

> chisq.test(Oi)

        Chi-squared test for given probabilities

data:  Oi
X-squared = 0.72, df = 3, p-value = 0.8685

We get the upper bound on the p-value. If that was below your significance level, you could reject.

--

If we did simulation:

res = replicate(100000,{x=rnorm(50);
                     Oi = table(cut(x,mean(x)+sd(x)*qnorm((0:4)/4)));
                     sum((Oi-Ei)^2/Ei)})
> sum(res>=0.72)/length(res)
[1] 0.76827

We get a p-value of 0.77. (This only takes a couple of seconds.)


If you must test normality, a Shapiro-Wilk test would be better:

> shapiro.test(y)

        Shapiro-Wilk normality test

data:  y
W = 0.9448, p-value = 0.02092

But in many cases, a Q-Q plot is probably more use than a formal hypothesis test:

enter image description here

because we can assess the kind and extent of the deviation from normality fairly readily - the left tail is "short", the right tail is somewhat "long" ... i.e. the data are mildly right skew. [Would this amount of skewness matter? For things like t-tests or regression or ANOVA, not very much -- but in any of those cases, the assumption relates to the conditional distribution, not the marginal, so you must be careful to check the right quantities.]

Glen_b
  • 257,508
  • 32
  • 553
  • 939