7

As far as I am concerned, statistical/machine learning algorithms always suppose that data are independent and identically distributed ($iid$).

My question is: what can we do when this assumption is clearly unsatisfied? For instance, suppose that we have a data set whith repeated measurements on the same observations , so that both the cross-section and the time dimensions are important (what econometricians call a panel data set, or statisticians refer to as longitudinal data, which is distinct from a time series).

An example could be the following. In 2002, we collect the prices (henceforth $Y$) of 1000 houses in New York, together with a set of covariates (henceforth $X$). In 2005, we collect the same variables on the same houses. Similar happens in 2009 and 2012. Say I want to understand the relationship between $X$ and $Y$. Were the data $iid$, I could easily fit a random forest (or any other supervised algorithm, for what matters), thus estimating the conditional expectation of $Y$ given $X$. However, there is clearly some auto-correlation in my data. How can I handle this?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Plastic Man
  • 332
  • 7

3 Answers3

14

There is nothing in the theory of statistical learning or machine learning that requires samples to be i.i.d.

When samples are i.i.d, you can write the joint probability of the samples given some model as a product, namely $P(\{x\}) = \Pi_{i} P_i(x_i)$ which makes the log-likelihood a sum of the individual log-likelihoods. This simplifies the calculation, but is by no means a requirement.

In your case, you can for example model the distribution of a pair $x_i,y_i$ with some bi-variate distribution, say $z_i=(x_i,y_i)^T$ , $z_i \sim \mathcal{N}(\mu,\Sigma)$ , and then estimate the parameter $\Sigma$ from the likelihood $P(\{z\}) = \Pi_{i} P(z_i | \mu, \Sigma)$.

It is true that many out-of-the-box algorithm implementations implicitly assume independence between samples, so you are correct in identifying that you will have a problem applying them to you data as is. You will either have to modify the algorithm or find ones that are better suited for your case.

J. Delaney
  • 1,447
  • 1
  • 8
  • I see, this is a good answer. I need two clarifications, though. 1) You mean that I could estimate the parameter $\mu$? 2) Your solution works if I am willing to assume a (bivariate) parametric distribution for my data. As far as I understood, ML is all about avoiding such assumptions, and learning the best function from empirical evidence (as discussed in Breiman, 2001). – Plastic Man Feb 07 '22 at 14:04
  • 1
    1) Yes you could estimate both $\mu$ and $\Sigma$, but note that I just gave this model for illustration, what is best for you will depend on the details of your data 2) This is a big question by itself, but every ML model is driven by some assumptions, either explicit or implicit (for example: the choice of a loss function). Just as you pointed out, if a certain model assumes that samples are i.i.d, it will not work well for data that is not. There is no "universal" ML model, so you always have to think about which underlying assumptions holds for your data – J. Delaney Feb 07 '22 at 14:34
  • 4
    Assuming iid when it's not satisfied is still a problem for ML. For example if there are non-completely-random dropouts, analyzing the data as if each row comes from a different subject will lead to prediction bias. – Frank Harrell Feb 07 '22 at 14:46
  • @J.Delaney that is exactly my concern. Going back to my example, say I want to fit a random forest (which does not impose any parametric distribution). Is there a way to take into account the repeated measurements characteristic? Clearly, I cannot just fit a random forest as it is provided in many statistical packages. I have a guess for a simplified case where, for some reason, we can believe that there is not auto-correlation: including the time variable (e.g., year) in the set of explanatory variable. Still, this would treat the same units at different time istances as different units. – Plastic Man Feb 07 '22 at 14:54
  • 1
    It depends on what you want your model to predict - assuming that the prices change over time. Do you want it to predict the average price over the period, the price at a given year (when you supply it as a feature) or the price in the future? – J. Delaney Feb 07 '22 at 18:21
  • I am accepting this answer according to vox populi. – Plastic Man Feb 09 '22 at 08:54
4

Markov processes are not only very general ways to analyze longitudinal data with statistical models, they also lend themselves to machine learning. They work because by modeling transition probabilities conditional on previous states the records are conditionally independent and may be treated as coming from different independent subjects. One can use discrete or continuous time processes, discrete being simpler. The main work comes from post-estimation processing to convert transition probabilities into unconditional (on previous state) state occupancy probabilities AKA current status probabilities. See this and other documents in this.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
3

There are a few good answers here already but I thought it worth noting that the answer to this question can change drastically depending on how the iid assumption is violated. For example, if a univariate dataset is not iid, but is stationary, then many very simple estimation procedures, such as the sample mean, still converge to the appropriate limit.

However, if the iid assumption is violated because the data is non-stationary, then life is much more difficult. Note that the very common Machine Learning tradition of splitting the dataset into a training, test, and sometimes validation set, is invalid in the presence of non-stationarity. If this is the difficulty you face then your best bet is usually to try and find a transformation of the data that is close to stationary (or ergodicity) and work with that instead.

Colin T Bowers
  • 745
  • 6
  • 23
  • Note that a Markov analysis can easily not assume stationarity. – Frank Harrell Feb 08 '22 at 14:35
  • I would have thought in the given example, you do a training-validation-test split by houses not by observation, so for example a test set has all the data for all test-set houses. Though it may depend on what you are trying to predict – Henry Feb 08 '22 at 19:32
  • 1
    @Henry For sure. The purpose of my answer was to address a general feature of the question that I felt the other answers had missed, rather than specifically dealing with the example. I just wanted to make the point that while lots of machine learning algorithms are valid under some violations of iid, most of them are invalid when you have nonstationarity. – Colin T Bowers Feb 09 '22 at 03:31