IID in real life /Machine Learning - When is data truly IID?

Question

In a course I am studying at Berkeley, some student said about a particular Dataset "Data is not iid" and the lecturer agreed with him.

https://youtu.be/kl_G95uKTHw?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX&t=910

So once and for all, I want to understand, for real data, how often is it really iid and how can I decide myself?

For example, let's say that I take 20 million images from Facebook "randomly" (let's say that each image has some unique integer associated with it, and I pick a random integer every time). Is this iid? What if I take a particular user's images, and then his friends's images etc in a BFS kind of way? Is this iid now?

While I understand the math definition I don't have the underlying distribution so it's very unclear to me. My gut feeling is that a real life dataset is almost never truly iid by the way.

By construction, all data that are independently and randomly sampled from a well-defined population are IID. This suggests your question needs additional context to narrow its scope. — whuber, Apr 10 '18 at 20:31
@whuber I have given two very specific data sets in the question itself, and you have another specific dataset in the lecture. IID? Not IID? why? — Yoni Keren, Apr 11 '18 at 17:18
Related https://stats.stackexchange.com/questions/213464/on-the-importance-of-the-i-i-d-assumption-in-statistical-learning — Tim, Apr 11 '18 at 19:04
Let me try again: *data* are never IID, but your *model* of the data can be. This doesn't raise an objection to your motivation, but it suggests restating the question along the lines of "what justifies the use of IID models for statistical analysis"? This is a broad inquiry, though, so in keeping with the aims of all SE sites, I encourage you to narrow it by describing a particular *analytical setting.* That includes more than a vague description of data (your "BFS kind of way" of selecting images, whatever that might mean): it extends to your objectives and planned analysis as well. — whuber, Apr 12 '18 at 14:38
@whuber Just so we are clear: You are saying that IID is a property of random variables, and since data is specific (for example a specific number of images and their labels) you are saying "data are never IID". Is that right? — Yoni Keren, Apr 12 '18 at 15:17
That's one valid interpretation, but I think the issue ranges further than that. You can always try to view the data as a realization of a (multivariate) random variable, but the question of independence is simultaneously a matter of exactly *how* you model the data as well as being an empirical issue, because you can (usually) evaluate to what extent the data conform to the IID assumption. That evaluation typically involves information external to the data themselves: it considers sample selection and measurement, among other things. — whuber, Apr 12 '18 at 16:26
See also https://stats.stackexchange.com/questions/344794/exchangeability-and-iid-random-variables, https://stats.stackexchange.com/questions/240445/are-we-ignoring-implications-by-de-finettis-theorem-on-regression/336685#336685, https://stats.stackexchange.com/questions/445453/realistically-does-the-i-i-d-assumption-hold-for-the-vast-majority-of-supervis/445477#445477 — kjetil b halvorsen, Mar 11 '20 at 00:17

Aksakal · Answer 1 · 2018-04-11T20:14:45.337

In order to answer the question, you need to understand what distribution the lecturer is talking about in the video you linked to in the question. For example, he talks about the "distribution of the labels conditional on the image": $$\pi_\theta(\texttt u|\texttt o)$$ where $\theta$ - (hyper)parameters of NN, $\texttt u$ - labels and $\texttt o$ - images. The output $\texttt u$ could be an action of a self-driving car etc.

Here we're talking about a univariate distribution of scalar labels $\texttt u$. The fact that our inputs $\texttt o$ are called "tensors" in ML lingo, doesn't make this a multivariate or joint distribution. We're not talking about the distribution of the inputs, but we're conditioning our output distribution on inputs (data).

So, in this case when we talk about IID data, we mean that this distribution should not depend on the image in a sequence, it should be the same $\pi_\theta$ for any input/output pair $o_i,u_i$. Is this a reasonable assumption? I would argue that it depends on the dataset.

Suppose that we're talking about color labels on the objects. If your images were tagged by color blind and not color blind people, then clearly at the very least $\pi_\theta$ may not be the same for all images and the "ID" (identically distributed) will drop out from IID. Here, I'm assuming you don't know whether your tagger was color blind or not. If the sample is random, then the first "I" (independent) still probably stays.

On the other hand, if our data set is with "cat images", then IID assumption will sound more reasonable: almost anybody who's asked to tag the images will probably be able to recognize a cat on the image as well as anyone else, for instance.

Summarizing, it's quite easy to break IID assumption if you're not careful, e.g. the sample is not random. However, in many cases you can build a data set which will satisfy IID assumption. I would argue that you can construct a sample from NIST database of handwritten digits and assume that the labels are IID.

Where does $\pi_\theta$ come from?

I see a lot of confusion in comments as to the nature of the probability distribution $\pi_\theta$ and even its role in image recognition problem. Let's take a closer look at how statistical learning problem is set up. Suppose that our goal is to recognize the hand written digits.

IF our goal was to recognize the digits from images in MNIST database THEN this would be a pure IT problem. Say, there is 70,000 images of single digits in MNIST, then we'd create a map with 70,000 entries. The key in the map (dictionary) is the image, and the value is one of 10 digits. The recognition reduces to simply finding an input image by bitwise comparison to keys, and pulling the value. This is the case when we treat MNIST database as the population, i.e. every possible outcome.

Obviously, this is not a useful approach, because we want to recognize the digits on any picture, not just the ones in MNIST database. Hence, MNIST database is just a sample from the infinite size population. In order to appreciate the difference with previous approach consider the following image from MNIST. Which digit is it, three or five?

The points are:

MNIST is just a sample from the population
the population includes all possible ways of hand scribbling digits
all possible ways of scribbling will include difficult cases like above
- also the cases when it will be truly impossible to tell which digit was meant to be written, e.g. scribbled by a child or a person with motor illness

Consider how this data set was constructed and you'll see where the disturbances (errors) come from:

NIST collected writing samples from hundreds of US Census employees and high school students. Each writer writes digits differently than anybody else.
every time we write a digit it's slightly different that previous time.
Camera placement or slight difference in how form is placed into a scanner
The exact cropping of the image from the scan or photo
The lighting will be different when photo copy or camera shot is done.
The censor in the copier or camera has thermal noise in pixels.
Errors while tagging images
etc.

All these factors lead to variations and distortions of the ideal digit image every time we get picture of it. When you have an infinite population that comprises of countless number of factors impacting the final outcome probabilistic approaches start making a sense.

This is why the statistical learning in image recognition sounds like a reasonable idea to try. It's not the only possible way to solve the problem, but it had a huge rate of success in recent years.

$\pi_\theta(u|o) $ Like you said is the distribution of the network's output, and is not part of the dataset. This is not what the lecturer or student were talking about at the point in time of the video I have posted in the question (in fact I don't recall anyone being interested if a NN's output is IID or not, it's always about the dataset). The dataset is a set of tuples (o,u) where o is the observation and u is the human's response. Aside from that, you did write a lot about NIST, but I don't see any conclusion as to either of those datasets being IID or not, nor a way to decide that. — Yoni Keren, Apr 12 '18 at 08:13
@YoniKeren, proving something is IID difficult, if not impossible. You make the assumption and go with it, maybe running some tests to show that the assumption is reasonable in some ways. proving independence empirically is a lost cause — Aksakal, Apr 12 '18 at 15:45

IID in real life /Machine Learning - When is data truly IID?

1 Answers1

Where does $\pi_\theta$ come from?