How should I interpret different p-values when all of them are far from the significance level?

Question

I was hoping to demonstrate the other day that a given data set is normally distributed, and a chi-squared test seemed appropriate. I made my null hypothesis that the data set was normally distributed, and calculated a chi-squared value and thus a p-value of about 0.5. This is well above any sane significance level, and thus I fail to reject the null hypothesis. Job done, right?

But I want to look a bit more closely at that p-value of 0.5. I'm told that this means that, if the population underlying my data set was indeed normally distributed, this would be the probability that I observed the data in question. But what if I had calculated a p-value of, say, 0.2? That's still a way off any sensible significance level, but it's also far from 0.5. Would the case for the normality of the data be a bit weaker if the p-value was only 0.2? What about if it was 0.9?

The context for the above question was this: I'm trying to work out how much the sizes of potatoes will vary when all of them have been harvested from a single field. So I did the following:

I gathered the data for all the potatoes harvested from a specific field.
I carried out a chi-squared test to examine the normality of the data ($\chi^2 \approx 0.5$).
I calculated a coefficient of variation ($\approx 4\frac{1}{2}\%$) for the data.
I made a hypothesis, to be tested by examining data from other fields, that 95% of the potatoes in a given field will fall in the size range $[0.91\mu, 1.09\mu]$ where $\mu$ is the mean size for that field.

Have I committed any grave sins against statistics in the above reasoning?

A chi-square test for normality is far from being a good test; it's pretty much the worst test on offer as it requires binning of the data and suppresses detail, especially in the tails of a distribution. A better test might be a Shapiro-Wilk test or a Doornik-Hansen test. Better yet is to look at and show a normal quantile plot (normal probability plot, normal scores plot, probit plot). The Catch-22 is that such plots are a little hard to interpret without experience, but there are helps. And such a plot lets your reader know about the data. — Nick Cox, Aug 29 '20 at 11:45

EdM · Accepted Answer · 2020-08-29T16:03:08.827

... [0.5] would be the probability that I observed the data in question

isn't quite right. It's really

the probability of occurrence of a more extreme value

than your observed value of the test statistic if the null hypothesis held; see the Wikipedia page, for example.

If the null hypothesis holds, then the p-values of your statistic have a uniform distribution. A p-value of 0.2 just means that you'd get a statistic greater than that 20% of the time under the null hypothesis; a p-value of 0.9 means you'd see a greater value 90% of the time.

The question you seem to be addressing has to do with the distribution of the p-values of the statistic under a specific alternative hypothesis. That's the basis of performing power calculations. Yes, you might be more interested in exploring an alternative hypothesis in future work if you found a p-value of 0.1 than if you found one of 0.9. But there's still a risk of 10% that you'd be chasing nothing. Think of statistical tests as guarding against fooling yourself into seeing something that isn't really there.

Two cautions here.

First, you aren't making a "case for normality" with this type of analysis. You're just failing to make a case against normality. That's an important distinction. You might just have too few cases to see a difference from normality.

Second, with large enough sample sizes in the real world you will almost always find "significant" deviations from normality. What matters is whether deviations from normality are large enough to make a practical difference for a particular application.

With respect to your situation, you have decided that your data were close enough to normal for your particular application and proceeded accordingly. That's fine, so far as it goes, although you have made an assumption that the SD for a field will be proportional to its mean value. Also, be warned that estimates and tests about what's going on at the extreme tails of a distribution can be difficult in the best of circumstances, and they can be very sensitive to deviations from the hypothesized distribution. So don't be surprised if, in practice, you find more or fewer than 5% of potatoes outside your estimated CI.

Thank you very much for your excellent and insightful answer. I've just added some context to my original question, and would be delighted if you were able to comment further. — Tom Hosker, Aug 29 '20 at 11:00
One other thing: why is it that 'If the null hypothesis holds, then your statistic has a **uniform** distribution', when I used a **normal** distribution to generated the values in the "expected" column of my chi-squared test? — Tom Hosker, Aug 29 '20 at 11:03
(+1) Excellent answer. As you say, non-rejection can also mean that the parent distribution is not normal but the sample size is not large enough to detect that. A extreme counter-example is a sample size of 2, which can never supply evidence against a normal distribution, but the problem is common for sample sizes of order 10 or 20. — Nick Cox, Aug 29 '20 at 11:47

Kyle · Answer 2 · 2020-08-29T14:05:19.347

The null hypothesis of the chi-square goodness of fit test is that the data is normally distributed (as you are testing for a normal distribution - you can test against other distribution types). If you choose 0.05 as your alpha level, then at p < 0.05 you reject the null hypothesis, as this falls under the 5% level using the chi-square₁ distribution ($χ2>χ21−α,k−c$). "The p-value is a probability computed assuming the null hypothesis is true, that the test statistic would take a value as extreme or more extreme than that actually observed."₂ As the chance is fairly low, you have confidence in rejecting the assumption of normality. With p < 0.20, this level is at 20%. An alpha of 0.05 is commonly used.

There is a related concept of Type 1 and Type 2 error rates: A Type I error is the rejection of a true null hypothesis, and Type 2 error is the non-rejection of a false null hypothesis.

Chi-Square Goodness-of-Fit Test
https://www.itl.nist.gov/div898/handbook/eda/section3/eda35f.htm
https://online.stat.psu.edu/stat504/node/20/
Also,
"The area under the curve between 0 and a particular chi-square value is a cumulative probability associated with that chi-square value." http://stattrek.com/probability-distributions/chi-square.aspx

You've improved your post. The piece of mathematics in the middle needs editing still to use subscripts. Definitions of $\alpha, k, c$ would be good. — Nick Cox, Aug 29 '20 at 15:28
The sentence has the same meaning as the previous version. I just quoted a university so I wouldn't be bothered anymore. "The p-value is a probability computed assuming the null hypothesis is true, that the test statistic would take a value as extreme or more extreme than that actually observed." vs "...which corresponds to the chance of data with this test statistic level actually being normally distributed." Have a nice day. — Kyle, Aug 29 '20 at 17:36

How should I interpret different p-values when all of them are far from the significance level?

2 Answers2