I am having trouble developing the intuition behind the difference between a regular generative Markov random field (MRF) and its discriminative counterpart.
So, as I think I have understood so far is as follows:
MRF aims to model the full joint distribution. So given the observations $X$ and predictions $Y$, we aim to model:
$$ P(X, Y) = P(X|Y) P(Y) $$
My first confusion is that most online references talk about difficulty modelling the complex interactions between the input $X$. Is this related to the term $P(X|Y)$ i.e. the data association term? Taking a concrete example, let us say I observe an image and every pixel in the image is an MRF site. I am interested in labelling every pixel as foreground or background. Now, is the problem with generative model related to issues with modelling $P(X|Y)$ in this case? So, in this example would this involve modelling the correlations between the different pixels of observed image? I am at a loss as to why modelling this joint distribution is so difficult.
Now, moving on to CRF, they aim to model the conditional probability distribution directly i.e. $P(Y|X=x)$. Again, I have no intuition as to why this should be an easier problem that the $P(X, Y)$. I can come up with some explanations like we can use $X=x$ somehow to our advantage and simplify the modelling process but have not been able to convince myself.
If someone can give an intuitive explanation and an example, I would be really grateful.