22

How would you test or check that sampling is IID (Independent and Identically Distributed)? Note that I do not mean Gaussian and Identically Distributed, just IID.

And idea that comes to my mind is to repeatedly split the sample in two sub-samples of equal size, perform the Kolmogorov-Smirnov test and check that the distribution of the p-values is uniform.

Any comment on that approach, and any suggestion is welcome.

Clarification after starting bounty: I am looking for a general test that can be applied to non time series data.

Ferdi
  • 4,882
  • 7
  • 42
  • 62
gui11aume
  • 13,383
  • 2
  • 44
  • 89

2 Answers2

20

What you conclude about if data is IID comes from outside information, not the data itself. You as the scientist need to determine if it is a reasonable to assume the data IID based on how the data was collected and other outside information.

Consider some examples.

Scenario 1: We generate a set of data independently from a single distribution that happens to be a mixture of 2 normals.

Scenario 2: We first generate a gender variable from a binomial distribution, then within males and females we independently generate data from a normal distribution (but the normals are different for males and females), then we delete or lose the gender information.

In scenario 1 the data is IID and in scenario 2 the data is clearly not Identically distributed (different distributions for males and females), but the 2 distributions for the 2 scenarios are indistinguishable from the data, you have to know things about how the data was generated to determine the difference.

Scenario 3: I take a simple random sample of people living in my city and administer a survey and analyse the results to make inferences about all people in the city.

Scenario 4: I take a simple random sample of people living in my city and administer a survey and analyze the results to make inferences about all people in the country.

In scenario 3 the subjects would be considered independent (simple random sample of the population of interest), but in scenario 4 they would not be considered independent because they were selected from a small subset of the population of interest and the geographic closeness would likely impose dependence. But the 2 datasets are identical, it is the way that we intend to use the data that determines if they are independent or dependent in this case.

So there is no way to test using only the data to show that data is IID, plots and other diagnostics can show some types of non-IID, but lack of these does not guarantee that the data is IID. You can also compare to specific assumptions (IID normal is easier to disprove than just IID). Any test is still just a rule out, but failure to reject the tests never proves that it is IID.

Decisions about whether you are willing to assume that IID conditions hold need to be made based on the science of how the data was collected, how it relates to other information, and how it will be used.

Edits:

Here are another set of examples for non-identical.

Scenario 5: the data is residuals from a regression where there is heteroscedasticity (the variances are not equal).

Scenario 6: the data is from a mixture of normals with mean 0 but different variances.

In scenario 5 we can clearly see that the residuals are not identically distributed if we plot the residuals against fitted values or other variables (predictors, or potential predictors), but the residuals themselves (without the outside info) would be indistinguishable from scenario 6.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
  • 1
    The first part of this answer, in particular, seems a little bit confused (or confusing) to me. Being iid is a well-defined *mathematical property* of a *finite set of random variables*. Your scenarios 1 and 2 *are identical* if the random variables in the second case are obtained "after losing the gender information". They're iid in both cases! – cardinal May 27 '12 at 01:15
  • 2
    @cardinal, so do you agree that the data in scenario 2 is not identically distributed before losing the gender information? So we would have a case where they are not identical, but the only way to tell the difference is to use information outside of the variable being looked at (gender in this case). Yes being IID is a well defined mathematical property, but so is being an integer, can you test whether the data point 3. is an integer stored as a floating point number or a continuous value that has been rounded without outside information about where it came from. – Greg Snow May 27 '12 at 03:34
  • 2
    So what you are saying is that there might exist some additional information contained in variables $Z$ so that marginally $X_i \perp X_j, i\neq j$, but $X_i|Z$ may no longer be independent of $X_j|Z$. In the first case, $Z$ is the vector of gender labels; in the second case, $Z$ is the design information. I think that's a good observation. – StasK May 27 '12 at 04:44
  • Greg, I understand perfectly what you're trying to say in your answer. I am simply pointing out that your first two scenarios could be adjusted to make the point better. For the purposes of testing the iid property of $X_1,\ldots,X_n$, it is irrelevant that there exists a set of random variables $Z_i$ such that $X_i|Z_i$ are not iid. In fact, in your scenario 1, such random variables will exist too (on a typical probability space). You're driving towards a point regarding the bigger picture (and an important one); but the example doesn't fit, in my opinion. – cardinal May 27 '12 at 11:55
  • (+50) Thank you for this great insight! @cardinal what example would you give instead of scenario 1? – gui11aume May 28 '12 at 11:42
  • GregSnow I don't completely agree with your assertion. It may be that you know that data come from a sequence of identically distributed random variables. You don't know exactly what model generated it. It could be that they are independently generated or alternately came from a stationary time series. To decide which is the case suppose that you know that the identical distribution is normal. Then both possiblities fall under the category of a stationary sequence and it will be iid if and only all the nonzero lag autocorrelations are 0. It is perfectly reasonable to test to see if the correla – Michael R. Chernick May 27 '12 at 02:27
  • 1
    But all of what you say above uses information about how the data was collected/generated, not just the data itself. And even if we have data that supports that there is no time series autocorrelation that does not tell us anything about spatial correlation or other types of non-independence. Can we really test for every possible type of dependence and get meaningful results? or should we use information about how the data was collected to guide which tests are most likely to be meaningful? – Greg Snow May 27 '12 at 06:26
  • 1
    I agree that there is no test that fits all. I think i made that point. It seems that testing for spatial correlation vs independence is different from testing whether a time series is correlated or not. It is the OP that was looking for a single test not me. What I disagreed with you about was that the issue could not be posed in a way that you could test for it. You seem to say that we could only answer the issue based on our knowledge of the data generation mechanism. – Michael R. Chernick May 27 '12 at 12:39
  • 1
    Now suppose for example that we use a random selection procedure. Should we take it for granted that it functions properly? People often test gambling or lottery devices to see if they are working in a "random" way. This is done mainly by testing frequencies of outcomes but correlations can be tested too. – Michael R. Chernick May 27 '12 at 12:39
  • In many cases the available outside information about the data allows us to say exactly, if the sample is IID or not. But in some cases this information is not fully enough to say it with confidence, and then we can try to find appropriate randomness test (such as Runs test) and use it. And Greg Snow clearly showed that it is impossible to say, if the sample is IID, when we don't have any outside information about the data (i.e. when we just have an array of numbers and that's all). – Rodvi Sep 18 '21 at 08:03
7

If the data have an index ordering you can use white noise tests for time series. Essentially that means testing that the autocorrelations at all non zero lags are 0. This handles the independence part. I think your approach is trying to mainly address the identically distributed part of the assumption. I think there are some problems with your approach. I think you need a lot of splits to get enough p-values to test for uniformity. Then each K-S test loses power. If you are using splits that overlap on parts of the data set the tests will be correlated. With a small number of splits the test of uniformity lacks power. But with many splits the uniformity test may be powerful but the K-S tests would not. Also it seems that this approach won't help detect dependence between variables.

@gu11aume I am not sure what you are asking for with a general test for non-time series. Spatial data provide one form of non-time series data. There the function called the variogram might be looked at. For one-dimensional sequences I don't see much difference between sequences ordered by time versus any other way of ordering the data. An autocorrelation function can still be defined and tested. When you say that you want to test independence in sampling, I think you have an order in which the samples are collected. So I think all the 1-dimensional cases work the same way.

JJJ
  • 115
  • 8
Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • 2
    (+1) since this is what I was thinking but Re: "If the data have an index ordering you can use white noise tests for time series. Essentially that means testing that the autocorrelations at all non zero lags are 0." - this logic only applies when you're dealing with a stationary time series, right? Otherwise, you could get misleading results about the lagged correlations. For example, what if only the "later" part of the time series was autocorrelated? – Macro May 18 '12 at 12:27
  • 1
    @Macro I thought that was what you had in mind based on your question to the OP. But I didn't think it was necessary to wait for his response to point this out. It applies when you are looking for independence. But I understand your point. In practice you only check the first k lags. If the series was stationary the correlations would decline with k but not so for nonstationary series. So at least in theory you would miss the correlation at large lags for a nonstationary series. – Michael R. Chernick May 18 '12 at 12:39
  • 2
    well, for a non-stationary time series it may not even sense to look at the autocorrelation as a function of lag. If ${\rm cor}(y_{t}, y_{s}) = f(s,t)$ and $f(s,t)$ is not a function of only $|s-t|$ then all sorts of weird things can happen by pretending it is. I'm really just asking if you have any ideas for the case where you know the time series is not stationary – Macro May 18 '12 at 12:45
  • Thanks for your answer Michael! You are right: in case the data is a time series, checking the auto-correlation is the best approach. As for your criticism of the split K-S approach, you also have a point. So, we are still left with no test in the general (non time series) case it seems. – gui11aume May 18 '12 at 12:53
  • If a time series is not stationary, it is obviously not IID either. – gui11aume May 18 '12 at 13:06
  • True gui11aume. I guess I was fixating on the first "I" in IID. – Macro May 18 '12 at 13:14
  • @MMacro After giving this some further thought I don't think we should fixate on nonstationary alternatives either. Under the null hypothesis the series is white noise which is trivially stationary. The big problem from the practical viewpoint I think is that you can't test infinitely many autocorrelations and the number of lags to look at depends on the length of the series. Suppose we have the stationary process X(t)=e(t) +a e(t-60) with e(t) a white noise sequence and |a|<1. – Michael R. Chernick May 18 '12 at 14:35
  • 2
    The first non-zero autocorrelation is at lag 60 and only at other multiples of 60. If the time series has length 55 we can't even observe two point 60 lags apart. Sowe can't check to see if the lag 60 correlation is 0 or not. If the length of the series is 65 we can estimate the lag 60 correlation but based on only 5 lag 60 pairs. So the variance of the estimate is large and we won't have power to detect this non-zero correlation. – Michael R. Chernick May 18 '12 at 14:35
  • Somebody retracked an upvote. Was it for what I added to the answer or something else? – Michael R. Chernick May 26 '12 at 21:49
  • @Macro Regarding the autocorrelation function, it is true that the definition of nonstationarity means that the correlation of the time series at time points s and t can depend on more than just their separation there are nonstationary time series where the mean function changes with time but the correlation function depends only on the time lag. I think a linear function of time with additive white noise and a sine function plus an additive white noise are such examples. – Michael R. Chernick May 27 '12 at 02:49
  • @MichaelChernick, I thought stationarity also required a constant mean. – Macro May 27 '12 at 03:02
  • @Macro Yes my example is intended to be an example of a nonsttionary process where the autocorrelation nevertheless depends only on the time lag. – Michael R. Chernick May 27 '12 at 04:46