How Do You Choose The Number of Bins To Use For A Chi-Squared GOF Test?

Question

I'm working on developing a physics lab about radioactive decay, and in analyzing sample data I've taken, I ran into a statistics issue that surprised me.

It is well known that the number of decays per unit time by a radioactive source is Poisson distributed. The way the lab works is that students count the number of decays per time window, and then repeat this many many times. Then they bin their data by the number of counts, and do a $\chi^2$ goodness of fit test with 1 parameter estimated (the mean) to check whether or not the null hypothesis (the data is drawn from a Poisson distribution with the estimated mean value) holds. Hopefully they'll get a large p-value and conclude that physics indeed works (yay).

I noticed that the way I binned my data had a large effect on the p-value. For example, if I chose lots of very small bins (e.g. a separate bin for each integer: 78 counts/min, 79 counts/min, etc.) I got a small p-value, and would have had to reject the null hypothesis. If, however, I binned my data into fewer bins (e.g. using the number of bins given by Sturge's Rule: $1+log_{2}(N)$), I got a much larger p-value, and did NOT reject the null hypothesis.

Looking at my data, it looks extremely Poisson-distributed (It lines up almost perfectly with my expected counts/minutes). That said, there are a few counts in bins very far away from the mean. That means when computing the $\chi^2$ statistic using very small bins, I have a few terms like: $$\frac{(Observed-Expected)^2}{Expected} = \frac{(1-0.05)^2}{0.05}=18.05$$ This leads to a high $\chi^2$ statistic, and thus a low p-value. As expected, the problem goes away for larger bin widths, since the expected value never gets that low.

Questions:

Is there a good rule of thumb for choosing bin sizes when doing a $\chi^2$ GOF test?

Is this discrepancy between outcomes for different bin sizes something that I should have known about*, or is indicative of some larger problem in my proposed data analysis?

- Thank you

*(I took a stats class in undergrad, but it's not my area of expertise.)

Seems like a sensitivity and specificity issue, i.e. you're getting type-II errors because you're measurements are too specific. — Jay Schyler Raadt, Dec 23 '17 at 21:52
Sorry -- can you elaborate on that? What does "too specific" mean? — Bunji, Dec 23 '17 at 21:54
A measurement that is too specific will produce type-II errors, but one that is too sensitive will produce type-I errors. For example, a very specific cutoff for an IQ test could leave a child with an IQ of 70.1 not qualifying for special education whereas a child with an IQ of 69.9 does. This would be a type-II error, where the null hypothesis "this child does not qualify" is falsely not rejected. Thus, a more sensitive measurement is needed, a bigger net, although too big a net might cause a type-I error where the null hypothesis is falsely rejected. — Jay Schyler Raadt, Dec 23 '17 at 22:03
Sure, sure -- that makes sense. Are you saying that maybe the choice of bin size is a proxy for sensitivity/specificity somehow? — Bunji, Dec 23 '17 at 22:06
I do not know, that's why I commented and I did not give you an answer to your question. — Jay Schyler Raadt, Dec 23 '17 at 22:10
1. The chi-square approximation can be quite poor if you have small expected values -- but you don't have to have constant bin-width either (as long as you're not choosing it with reference to the values of the observed counts). 2. "*Hopefully they'll get a large p-value and conclude that physics indeed works (yay).*" -- I expect you already know, but it should be made clear: failure to reject the null doesn't confirm that the null is true; it suggests that any deviation from Poisson wasn't large enough to reliably detect. ... ctd — Glen_b, Dec 24 '17 at 09:36
ctd ... 3. If you know something about likely alternatives - e.g. if you know that deviations from the Poisson should be smooth - then there are tests with better power than the chi-squared test. — Glen_b, Dec 24 '17 at 09:41
1. If I stick with constant widths, is ok to start the bins at the smallest count, and end at the largest, or is that bad? 2. Of course! This, in many cases, will be the students' first exposure to both p-values and hypothesis testing. As such, there's a fine line I need to walk between technical correctness, and seeing the forest through the trees. I want to make sure I can make a fairly plug-n-chug stats portion of the lab so that they can focus on the physics. But I _will_ try to be more careful with my language. 3.On a similar note, would there be any that are as simple to implement here? — Bunji, Dec 24 '17 at 14:56
@Glen_b given that the expected $\mu$ is so high wouldn't an Anderson Darling test be OK (with a simulated distro of course) if not over somewhat over-conservative? For the OP to see counts of ~80 realistically $\mu$ is at least 40-45... — usεr11852, Apr 22 '19 at 15:45
At https://stats.stackexchange.com/a/17148/919 I list the requirements for constructing bins in the $\chi^2$ test, some of which are often overlooked with potentially disastrous results (as shown by example). This, then, establishes a minimum answer to your question. Tautologically, the *number* of bins (to be established independently of the data!) should be small enough to assure the distribution of the statistic under the null hypothesis is sufficiently well approximated by a $\chi^2$ distribution. — whuber, Apr 23 '19 at 03:41
@usεr11852 with appropriate adjustment for both discreteness and estimation, quite possibly that would have good power across a variety of situations, though as with many omnibus tests there's also some issues with the Anderson-Darling (test bias for example - for example against light tails - may be anticipated). Ideally, one identifies classes of alternatives of particular interest and chooses a test which focuses power there, though this is not always possible, of course. — Glen_b, Apr 23 '19 at 06:24
OK, thank you all for your attention to this. @Whuber, your answer to the other question is incredible. Would you, then, say that the answer to my first question is basically just, "no" -- there is no good rule of thumb at this level? — Bunji, Apr 23 '19 at 20:19
There are many considerations. I think there may be some useful rules of thumb. For instance, I have usually been successful by guessing what the distribution of counts will be and creating bins expected to have approximately equal counts of 5 or more each; but it's rare to need more than 20 bins. Sometimes I'm looking for discrepancies within particular ranges, such as the distributional tails, and so within those ranges I might create narrower bins in order to detect detailed differences. — whuber, Apr 23 '19 at 21:14
@whuber By the way, since your latest comment is closest to the sort of answer I was looking for (i.e. fairly accessible to a non-stats audience), if you want to write it as an answer, I will happily give you the bounty. — Bunji, Apr 25 '19 at 15:10

Zhubarb · Answer 1 · 2019-04-26T07:28:07.737

Is this discrepancy between outcomes for different bin sizes something that I should have known about*, or is indicative of some larger problem in my proposed data analysis?

The binning of the radioactive decay sample set is a red herring here. The real problem originates from the fact that chi-square (alongside other hypothesis testing frameworks) is highly sensitive to sample size. In the case of chi-square, as sample size increases, absolute differences become an increasingly smaller portion of the expected value. As such, if the sample size is very large we may find small p-values and statistical significance when the findings are small and uninteresting. Conversely, a reasonably strong association may not come up as significant if the sample size is small.

Is there a good rule of thumb for choosing bin sizes when doing a χ2 GOF test?

The answer seems that one should not aim to find the right N (I am not sure it is doable, but would be great if someone else chips in to contradict), but look beyond p-values solely when N is high. This seems a good paper on the subject: Too Big to Fail: Large Samples and the p-Value Problem

P.S. There are alternatives to χ2 test such as Cramer's V and G-Test; however you will still hit the same issues with large N -> small p-value.

How Do You Choose The Number of Bins To Use For A Chi-Squared GOF Test?

1 Answers1

Linked