Simulating the impact of non-IID data on a model

Question

I have data that is non-IID, and I want to estimate if the dependence is bad enough that it will have a noticeable effect on a fitted classifier. I don't think the exact model type will matter in this case, but for argument's sake let's say I'm using elastic-net logistic regression. In this case the dependence takes the form of clustering among observations, in that if $Y_k = 1$ has a high probability then the probability that $Y_j = 1$ is very low for all $j\neq k$ within the data cluster.

Ideally, I would like to be able to compare the fitted model from the non-IID data set to a "comparable" or "similar" IID data set. So I'm thinking I could just simulate such a data set, fit the model on the fake data and the real data,

Is there a formal or rigorous definition of "similarity" that makes sense in this case? I certainly know a dissimilar data set when I see one, but it's hard to quantify exactly how I know.
Is there a straightforward way to generate an IID dataset from a non-IID data set that otherwise preserves some structure from the joint distribution of features?
Is this an X-Y problem? Is there a better way to evaluate the effect of data dependence on my estimates?

edit:

For a purely predictive task, does non-IID data even make a difference as long as the cross-validation procedure is constructed correctly? This answer suggests the answer is "no"

Elaborate more on why your data isn't IID? and what IID data might look like? — Matthew Gunn, Oct 25 '16 at 23:46
You could try keeping the same data set, but re-assign the class values $(Y_k)$ independently at random (having the same overall proportion of $Y_k = 1$). You could run this analysis many times and see what changes. I've used a similar approach in a different problem, but it may not work here. — knrumsey, May 25 '18 at 00:40
General: What is the classification? Question 1: What do you mean by similarity, e.g. having the same pdf? Q2: I don't understand what you mean. Q3: it might be, but I again I don't understand the example. In bullet point 4 it becomes clear, that two things are mixed: correlation (read as: non-independence) and generalization power are not really related. In other words: if there is correlation in your input/training data, then this does not mean that you're result/prediction quality is worse. — cherub, May 25 '18 at 14:17

Rob · Accepted Answer · 2018-05-31T06:46:29.907

I have data that is non-IID, and I want to estimate if the dependence is bad enough that it will have a noticeable effect on a fitted classifier. I don't think the exact model type will matter in this case, but for argument's sake let's say I'm using elastic-net logistic regression.

On the importance of the i.i.d. assumption in statistical learning

Regularized Logistic Regression: Lasso vs. Ridge vs. Elastic Net

Predictive Modeling - Should we care about mixed modeling?

Ideally, I would like to be able to compare the fitted model from the non-IID data set to a "comparable" or "similar" IID data set. So I'm thinking I could just simulate such a data set, fit the model on the fake data and the real data,

Is there a formal or rigorous definition of "similarity" that makes sense in this case? I certainly know a dissimilar data set when I see one, but it's hard to quantify exactly how I know.

An R software library can help, consider: "Statistical Patterns in Genomic Sequences" by Hart and Martínez which could be extended to work with arbitrary symbolic sequences, not only DNA sequences.

It features easy to use functions:

diid.disturbance - Construct feasible Random Noise Generating a Bernoulli Process

Produces a sequence of random noise which would generate an observed sequence of finite symbols provided that the sequence of symbols results from a Bernoulli process.

diid.test - A Test for a Bernoulli Scheme (IID Sequence)

Tests whether or not a data series constitutes a Bernoulli scheme, that is, an independent and identically distributed (IID) sequence of symbols, by inferring the sequence of IID U(0,1) random noise that might have generated it.

Category       Function           Test
Uniformity     ks.unif.test       Kolmogorov-Smirnov test for uniform$(0,1)$ data
               chisq.unif.test    Pearson’s chi-squared test for discrete uniform data,
Independence   lb.test            Ljung-Box $Q$ test for uncorrelated data
               diffsign.test      signed difference test of independence
               turningpoint.test  turning point test of independence
               rank.test          rank test of independence

Search for a library that exactly meets your requirements or modify one that nearly does.

Also check out the NIST SP 800-22 Statistical Tests which can be used to test your data.

Is there a straightforward way to generate an IID dataset from a non-IID data set that otherwise preserves some structure from the joint distribution of features?

Create something like a k-distribution by compounding some existing distributions to meet your requirements. Use something like DistributionFitTest to check the generated IID.

Is this an X-Y problem? Is there a better way to evaluate the effect of data dependence on my estimates?

A tiny bit XY, maybe; you've waited a year and a half for an answer, even Tweeted it. You made a minor edit. If I don't get an answer in less than a week I make certain to check if I missed something. You've been asked a few questions and answered one. I have no complaints or questions ...

Test with both (non-IID, IID) generated random models (easy with the software suggested above, though probably more power in a different package), your own data, and both. Create a set of "Ensemble Algorithms" that don't blow up.

For a purely predictive task, does non-IID data even make a difference as long as the cross-validation procedure is constructed correctly? This answer suggests the answer is "no": Predictive Modeling - Should we care about mixed modeling?.

The answer requested further information and links, further it was not an absolute "no".

Simulating the impact of non-IID data on a model

1 Answers1