In this question, it was stated that the assumption of i.i.d. for data comes in the form of $$(X_i,y_i)∼P(X,y),∀i=1,...,N \\(X_i,y_i) \;independent\; of \;(X_j,y_j),\;∀i≠j∈{1,...,N} $$ I am clear with the definition of i.i.d. and its concepts, however it is still rather unclear to me how this assumption is applicable.
To illustrate my confusion with an example, say we are looking at a classification problem, where $X$ is the input feature and $y$ is the label.
When we generate $n$ samples for training, I would think of it as drawing $(X_i,y_i)$ from the joint distribution of $X$ and $y$. How is the concept of independent and identical distribution relevant here then? Aren't $(X_i,y_i),\; for \; i =0,...,n$ all being drawing from the same distribution of $X$ and $y$.