References on data partitioning (cross-validation, train/val/test set construction) when data are non-IID

Question

Consider a prediction setting in which we are interested in training a regression or classification function $f$ with inputs $X \in \mathbb{R}^k$ and target $Y$, and assessing its expected generalization performance on new data. For example, in a regression setting, we might want to estimate what our model's mean absolute error (MAE) or root mean square error (RMSE) will be on unseen data, and we typically do that by evaluating our model's predictions on a held-out test set.

Many standard textbooks (for example, Chapter 7, Model Assessment and Selection, of The Elements of Statistical Learning) discuss data partitioning and cross-validation in a context where we are dealing with i.i.d. samples from a joint distribution $F\left(X, Y\right)$. In this case, constructing the test set is straightforward: we take a simple random sample and call it our test set.

Are there any standard references that discuss how model assessment strategies should be modified when data are not i.i.d.? I've seen discussions that are specific to particular settings (for example, Data partitioning for spatial data is about spatial data), but I was wondering whether there is a general reference covering multiple settings.

Examples include:

Spatial data: the model may generalize more easily to points that are geographically close to the training set, and if we are interested in estimating the model's ability to generalize to points that are far from the training set, we'd need to account for that in our data partitioning / test set construction
Data with a natural discrete group or hierarchical structure: for example, if we are dealing with patients and hospitals, the question "how does my model generalize to new patients in a hospital that was included in the training set" differs from "how does my model generalize to new hospitals", and we should construct our test set(s) accordingly; we may even want to answer both questions, which we could do with two different test sets (one on held-out patients and another on held-out hospitals)
Similarly, if dealing with panel data (such as observing individuals over time), the question "How does my model do when predicting on an individual who was already observed $K$ times?" might have a different answer than "How does my model do on its first prediction for an individual who was never observed before?"
Time series data: in some time series contexts (see https://stats.stackexchange.com/a/195438/9330 for an exception), we want to construct our test sets such that they cover a period that is entirely "in the future" relative to the training set
More complex real-world examples might involve multiple "complications" relative to the simple i.i.d. case: we might be dealing with spatiotemporal data, for example, and it might also have a hierarchical or group structure

Does any textbook or paper discuss cross validation or test set construction in real-world problems, in a way that is general enough to cover all of the examples above (and possibly more)?

https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation covers som of these topics (grouped data, time series -- but not spatial data). Does anyone have a good textbook reference that covers all of this in more detail? Or a paper? — Adrian, Feb 11 '22 at 23:06
Excellent question that many people in practice pay too little attention to and I'd love more good references on this. I personally keep sharing this post https://www.fast.ai/2017/11/13/validation-sets/ and keep looking at what people do in places like Kaggle. The "How to win a data science competition" course on Coursera has some nice materials on this, too. — Björn, Feb 12 '22 at 09:06
That post is definitely related to this question, thank you for the link. The example that says "you would be most interested in how the model performs on drivers you haven’t seen before" could fit into the panel data or group structure examples I gave above. — Adrian, Feb 14 '22 at 22:33
This heavily depends on the goals and what you expect at deployment time. So it is domain specific and not really a machine learning topic. You need to split on anything that you want to *claim* generalization capacity for. So you must split on person, on sensor brand, source country, etc. depending on the use case. If you'll have the model trained on location-specific data on deployment, you don't need to split on location, as you don't care about generalization wrt location. Depending on the dataset, splitting on many things may force you to discard a lot of data. — isarandi, Feb 16 '22 at 21:01
I agree that the answer will often be domain-specific, and that "you need to split on anything that you want to claim generalization capacity for" (that's a great way of putting it). But I don't see how that makes this "not a machine learning topic"! — Adrian, Feb 18 '22 at 19:37
This type of question, request for resources (and on a broad topic), is not very standard here. The type of question does occur but it doesn't fit well the system of voting and accepting answers. It is very broad question and not clear what answer should be upvoted and/or accepted. Adrian, is the answer by @ClosedLimelikeCurves satisfying? If it needs improvement could you guide people here to tell how it should be improved if necessary. — Sextus Empiricus, Feb 22 '22 at 08:05

score 3 · Answer 1 · answered Feb 22 '22 at 07:17

It really all boils down to two rules of thumb:

When splitting your data, leave out what you want to predict. If you want to generalize to new hospitals, rather than new patients at the same hospital, leave out one hospital at a time when doing CV — do not leave out one patient at a time, as this only tests your ability to generalize to patients at the same hospital.
When doing cross-validation, split your test data into folds that can be considered approximately independent. For example, with time series data, you want to leave out a single run/“chunk” of observations at a time. If you have a time series running from 1900 to 2000 and want to use 10 folds, the first fold should be the first 10 years, the second the next 10, and so on. The idea here is that even if a time series isn’t independent, we can think of 10 years as enough for most of the correlation to disappear, especially if we’re comparing models that are already OK at dealing with the time series structure of our data. If we assign each year to a random fold, a model can easily “Cheat” by assuming that 2020 will look just the same as 2021 and 2019, but it’s hard to predict 2020 from 2010. A correlogram can help you identify how long a lag is enough that you can consider each block “Basically independent” of the others.

Some relevant papers:

https://onlinelibrary.wiley.com/doi/abs/10.1111/ecog.02881

https://www.sciencedirect.com/science/article/pii/S0020025511006773

https://www.tandfonline.com/doi/full/10.1080/00949655.2020.1783262

You can also check out the Sperrorest R package for this.

"Data often show temporal, spatial, hierarchical ... or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross-validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross-validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence." Exactly what I had in mind, thank you! — Adrian, Feb 22 '22 at 18:45
Also related (although it's specific to spatial data): https://www.nature.com/articles/s41467-020-18321-y `Spatial validation reveals poor predictive performance of large-scale ecological mapping models` — Adrian, Feb 22 '22 at 18:50
@Adrian forgot to mention -- you can also use time-series models in general and then look at the usual diagnostics like AIC or R^2. If you model your time series well, these measures will let you predict out-of-sample performance, because your data will be *conditionally* independent. For instance, an AR(1) model implies your data are independent, after correcting for a linear function of the previous value. — Closed Limelike Curves, Feb 23 '22 at 05:39

frank · Answer 2 · 2022-02-19T10:46:47.407

In a nutshell:

You are referring to the problem of selection bias in data. And you seem to be concerned with one particular type of selection bias: the covariate shift.

Details:

If your data is not iid, your problem already starts in the training phase: all the major models presume iid data, otherwise, they don't work. Take for example (deep) neural networks: the data needs to be independent because this is our sufficient condition for the stochastic gradient descent (backpropagation) to work; without it, SGD probably fails. And we need the data to be identically distributed because we are building a single model that is supposed to describe them all.

From your post, I gather that you are mainly concerned with the "identically distributed" part of "iid", so I will focus on that.

The real problem in the situations you are describing above is that the training dataset is biased. And one way to detect this is by looking at the generalization capabilities of your trained model. And for that to work, you don't need special methods, you just need to make sure that your test dataset is representative of the population, i.e. unbiased.

Your question is: "Are there any standard references that discuss how model assessment strategies should be modified when data are not i.i.d.?" But you should change your data selection method, your model, and your training in this case, not the model assessment strategies. Model assessment should assess generalization performance in any case, no matter whether your data is identically distributed or not.

You mention cross-validation (CV): CV gives us an approximation of the prediction error on test data by using its folding technique on the training data. And one uses this approximation when choosing the model parameters that minimize it. Now, if CV is given biased data, it will not be able to properly approximate the prediction error and thus the generalization performance will be low. The same goes with other model assessment methods like AIC, BIC, WAIC, ... They all try to predict generalization performance from the training data, which is futile if this training data is biased. Any method can only use what it is given, the training dataset, and if that is biased, the model assessment will be biased. Given only biased data, there is simply no method that can properly assess model quality (i.e. generalization performance).

Covariate shift as a special type of selection bias

As far as I can see, the situations you are interested in are examples of a particular type of bias: covariate shift. This means that, if your data consists of pairs $(\mathbf{x}_i, y_i)$, the conditional distributions $p_{train}(y|\mathbf{x})$ for training data and the conditional distribution $p_{test}(y|\mathbf{x})$ for test data are the same, but the marginals over the independent variables $\mathbf{x}$ differ: $$ \begin{align} p_{train}(y|\mathbf{x}) &= p_{test}(y|\mathbf{x})\\ p_{train}(\mathbf{x}) &\ne p_{test}(\mathbf{x}). \end{align} $$

This is arguably the simplest of all forms of selection bias. Being able to distinguish between different kinds of bias might help when searching for relevant literature and reading the papers referenced below.

What do we do in the rare case that we have information about the population that goes beyond the training data?

Imagine, that besides the training dataset, you have additional information about the population. Maybe you know the distribution of the whole data and you just don't know whether your training subset is biased. Maybe you can draw information from your domain knowledge. Maybe you have data from related datasets. There is a lot of research going on, determining how this additional information can help you reduce the impact of selection bias.

E.g. what do you need to figure out whether the dataset you are given is biased? And if it is biased, is it possible to change the learning to adapt to this bias? There is a lot of pertinent research done and many papers exist. One cool paper is:

Bareinboim, Elias, Jin Tian, and Judea Pearl. "Recovering from selection bias in causal and statistical inference." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 28. No. 1. 2014.

They describe what different kinds of selection bias there are, how to detect it, how to understand whether it hurts my result, and how to recover if it does. Another paper, that is dealing explicitly with recovering from covariate shift in the training data when the true (unbiased) distribution is known, is:

Huang, Jiayuan, et al. "Correcting sample selection bias by unlabeled data." Advances in neural information processing systems 19 (2006).

There are many more papers of this kind, but those two belong to the most famous ones.

"From your post, I gather that you are mainly concerned with the 'identically distributed' part of 'iid'" -- that's a helpful clarifying question, but I'd say the opposite: I'm actually more focused on (violations of) the "independent" part of IID, as in the examples I gave (spatial data, panel data, ...). — Adrian, Feb 20 '22 at 00:42
Let's consider the spatial data example: there is some true distribution $p(x,y)$ (that we do not know) and from that, we take two samples $(x_1, y_1)$ and $(x_2, y_2)$ (where $x$ is "spatial"). Now, dependence means, that the distribution for the second sample differs, depending on the first sample: $p((x_2, y_2) | (x_1, y_1) = (a, b)) \ne p((x_2, y_2) | (x_1, y_1) = (c, d))$. But above you referred to the situation, where distributions of $y$ depend on the spatial $x$, i.e. $p(y|x_1) \ne p(y|x_2)$ for $x_1 \ne x_2$. But those are two different concepts. — frank, Feb 20 '22 at 04:44
@Adrian Maybe you can give me your definition of independence? — frank, Feb 21 '22 at 12:24
This is not correct. There is a huge literature on training and building models that can deal with correlated data well. Bayesian hierarchical models can deal with clustered data easily; time series analysis is a massive field in statistics; and geostatistics is out there too. Moreover, none of this is related to selection bias. — Closed Limelike Curves, Feb 22 '22 at 07:28
@Closed Limelike Curves I am not saying that there are no models for correlated data. But the situation the OP refers to, like e.g. his example with spatial data, refers to the problem of selection bias. ". — frank, Feb 22 '22 at 08:40
@Closed Limelike Curves The OP says: "we are interested in estimating the model's ability to generalize to points that are far from the training set". So his question is about a model generalizing to other data, not about including this other data into the model, as e.g. with time series models. Because then this other data would also be part of the training data. — frank, Feb 22 '22 at 08:44

References on data partitioning (cross-validation, train/val/test set construction) when data are non-IID

2 Answers2