I have data that is non-IID, and I want to estimate if the dependence is bad enough that it will have a noticeable effect on a fitted classifier. I don't think the exact model type will matter in this case, but for argument's sake let's say I'm using elastic-net logistic regression.
On the importance of the i.i.d. assumption in statistical learning
Regularized Logistic Regression: Lasso vs. Ridge vs. Elastic Net
Predictive Modeling - Should we care about mixed modeling?
Ideally, I would like to be able to compare the fitted model from the non-IID data set to a "comparable" or "similar" IID data set. So I'm thinking I could just simulate such a data set, fit the model on the fake data and the real data,
- Is there a formal or rigorous definition of "similarity" that makes sense in this case? I certainly know a dissimilar data set when I see one, but it's hard to quantify exactly how I know.
An R software library can help, consider: "Statistical Patterns in Genomic Sequences" by Hart and Martínez which could be extended to work with arbitrary symbolic sequences, not only DNA sequences.
It features easy to use functions:
diid.disturbance - Construct feasible Random Noise Generating a Bernoulli Process
Produces a sequence of random noise which would generate an observed sequence of finite symbols provided that the sequence of symbols results from a Bernoulli process.
diid.test - A Test for a Bernoulli Scheme (IID Sequence)
Tests whether or not a data series constitutes a Bernoulli scheme, that is, an independent and identically distributed (IID) sequence of symbols, by inferring the sequence of IID U(0,1) random noise that might have generated it.
Category Function Test
Uniformity ks.unif.test Kolmogorov-Smirnov test for uniform$(0,1)$ data
chisq.unif.test Pearson’s chi-squared test for discrete uniform data,
Independence lb.test Ljung-Box $Q$ test for uncorrelated data
diffsign.test signed difference test of independence
turningpoint.test turning point test of independence
rank.test rank test of independence
Search for a library that exactly meets your requirements or modify one that nearly does.
Also check out the NIST SP 800-22 Statistical Tests which can be used to test your data.
- Is there a straightforward way to generate an IID dataset from a non-IID data set that otherwise preserves some structure from the joint distribution of features?
Create something like a k-distribution by compounding some existing distributions to meet your requirements. Use something like DistributionFitTest to check the generated IID.
- Is this an X-Y problem? Is there a better way to evaluate the effect of data dependence on my estimates?
A tiny bit XY, maybe; you've waited a year and a half for an answer, even Tweeted it. You made a minor edit. If I don't get an answer in less than a week I make certain to check if I missed something. You've been asked a few questions and answered one. I have no complaints or questions ...
Test with both (non-IID, IID) generated random models (easy with the software suggested above, though probably more power in a different package), your own data, and both. Create a set of "Ensemble Algorithms" that don't blow up.
- For a purely predictive task, does non-IID data even make a difference as long as the cross-validation procedure is constructed correctly? This answer suggests the answer is "no": Predictive Modeling - Should we care about mixed modeling?.
The answer requested further information and links, further it was not an absolute "no".
See also:
"Choosing the Correct Statistical Test" (Derived from Dr. Leeper's work) - General guidelines for choosing a statistical analysis and links showing how-to using SAS, Stata, SPSS and R.
Choosing a statistical test, Data Analysis Resource Center, and GraphPad Analysis Checklist.
I might have time to return for another edit and improve this answer.