1

I'm working on a research project for one of my professors. He wants to know whether a variable that takes on different values over a large period of time (say 1,000,000 different values) is i.i.d. and would like me to design an independence test for it.

I've been learning about independence tests such as Chi-square, McNemar and seen a bunch of research papers floating around about specific cases, but none seem to fit this case. The thing I'm most hung up on is that with all the examples I see, you test one variable against another. I suppose I could assume that all 1,000,000 instances of this variable are different random variables and construct a test that way, but I assume there has to be a better way.

I'd appreciate if you can point me in the right direction and recommend some good textbooks/other reference materials I can refer to! Thanks.

  • Can be that you don't need an Independence test, but rather a change-of-point test for your time-distributed variable. – Match Maker EE Aug 09 '20 at 14:51
  • 5
    With a simple expedient, you don't need to design any tests (which is a fraught and huge subject: read Knuth's *Art of Computer Programming* for the details). Instead, you can capitalize on what the experts have learned simply by applying an empirical probability integral transform to create a series of *uniformly* distributed values and running a standard test of a pseudorandom number generator, such as the Diehard suite. – whuber Aug 09 '20 at 14:54
  • Start out with some simpler methods, plotting, compute the autocorrelation function, ... – kjetil b halvorsen Aug 09 '20 at 19:32

1 Answers1

0

Here is some 'low-hanging fruit' where auto-correlation plots and perhaps a runs test reveal marked lack of IID behavior of a sequence. (See @kjetilbhalvorsen's Comment.)

Data from the late 1970s show that eruptions of Old Faithful geyser in Yellowstone National Park were of short (0) or long (1) duration (less or more than 3 min in length) approximately according to a 2-state Markov chain--with no occurrences of two short eruptions in a row. Over the long run, the proportion of long eruptions is about 69%. The R code below simulates 2000 eruptions x according to this Markov chain.

set.seed(2020)
m = 2000;  n = 1:m;  x = numeric(n);  x[1]=0
a = 1;  b = 0.44
for (i in 2:m) {
  if (x[i-1]==0) x[i] = rbinom(1,1,a)
  else           x[i] = rbinom(1,1,1-b)
  }
mean(x==1)
[1] 0.7005

By contrast, the sequence y has 2000 independent Bernoulli observations with success probability $p=0.7.$

set.seed(809)
y = rbinom(2000, 1, .7)

ACF plots show significant autocorrelations with lags 2, 3 and 4 (outside the dotted bounds) for the Old Faithful chain (left). The Markov dependence "decays" after a few steps.

By contrast, there are no significant autocorrelations for the IID Bernoulli observations.

enter image description here

par(mfrow=c(1,2))
 acf(x, main="Old Faithful")
 acf(y, main="Bernoulli")
par(mfrow=c(1,1))

Here is a link to a recent discussion of runs tests on this site.

Note: The ACF plot for lengths of Old Faithful eruptions is similar to one in Suess (2010) p146, Springer.

BruceET
  • 47,896
  • 2
  • 28
  • 76