4

I have a sequence of bytes. My null hypothesis is that each byte is drawn independently from a uniform distribution. What is the correct statistical tool for testing such a hypothesis?

My first thought was Pearson's chi-squared test. However... apparently that only works reliably when the expected frequencies are quite large. (Depending on who you ask, a minimum expected frequency of ten or more.) Given that each byte has 256 possible values, you're going to need a lot of bytes!

Wikipedia says something about trying Fisher's exact test instead... but only explains how to do that for a $2\times2$ table. I don't really understand how to apply it here.

Is a chi-squared test a good idea? Are there tools which will work with less data? What other approaches can I try?

I'm less interested in detecting perfect randomness, and more interested in detecting things which are plainly and obviously not random.

Note: I'm looking for an algorithm that a machine can perform. Answers such as "draw a graph and eyeball it" don't help me. Humans are already very good at seeing non-random patterns anyway; my difficulty is teaching a machine to do this automatically.

MathematicalOrchid
  • 2,430
  • 3
  • 13
  • 15
  • It depends on which kinds of alternatives you want to give the most power to. For example, are the 256 possibilities regarded as just 256 separate categories, with no particular ordering, or are they 'ordered' in the sense that a value of '37' is regarded as *closer* to '36' than it is to '93' (I'm writing the byte values as decimal values)? Looking at it another way, is each bit exactly equally important, or are some bits more important? – Glen_b Jun 16 '13 at 09:59
  • @Glen_b 256 distinct categories, no particular ordering. Two bytes are simply _equal_ or _not equal_. In my particular application, that's all that matters. – MathematicalOrchid Jun 16 '13 at 10:31
  • 2
    Then without any clear identification of preferred alternatives, the chi-square is pretty much as good as you can do (if your sample sizes are large you might look at a G-test I guess, but I don't think it matters). But the number of observations required will depend on how big a deviation from uniformity you care about - it may not need to be as large as all that. How big a sample can you take? – Glen_b Jun 16 '13 at 11:33
  • @Glen_b In some cases I'm limited to as little as 1,000 bytes. That's not really enough for chi-square (expect ~4 per bin). Perhaps I could try counting individual _bits_ instead to see if that deviates from 50/50... – MathematicalOrchid Jun 16 '13 at 12:14
  • Actually, quite a few people use an expected value of 5 per bin as the lower limit. Performing exact tests on such data is a hard problem to solve in any reasonable time. The computational problems are intense. StatXact is a program that has developed relatively fast algorithms for this problem. – Peter Flom Jun 16 '13 at 12:33
  • "*That's not really enough for chi-square*" --- well, actually, with equal expected numbers, that's plenty - but even if it wasn't enough for the chi-square approximation to work, all that would do is change the *null distribution*. That's a solvable problem; for example, by simulation, or by well constructed exact methods. Your much bigger problem is going to be lack of power, not lack of ability to carry out the test. – Glen_b Jun 16 '13 at 13:07
  • http://en.wikipedia.org/wiki/Diehard_tests – whuber Jun 16 '13 at 14:17

0 Answers0