Why is the objective different in discriminative and generative learning?

Question

I'm wondering why, in a generative learning algorithm, they try to maximize the probability

$$\prod_{i=1}^np(x^{(i)}, y^{(i)})$$

while, in a discriminative learning algorithm, it is

$$\prod_{i=1}^np(y^{(i)} | x^{(i)})$$

I have two questions.

What is the reason for maximizing these terms?
Why are these optimization objectives different? What is the meaning of this difference?

Of possible interest: [Generative vs. discriminative](http://stats.stackexchange.com/q/12421/930). — chl, Jan 04 '14 at 09:46

score 1 · Answer 1 · answered Dec 05 '14 at 01:19

This is more or less the definition of generative vs. discriminative modeling (with the iid assumption).

First, it is important to understand the difference between these two probabilities (joint vs conditional); the semantics are different.

I'm assuming you are talking about maximum likelihood estimation. The purpose is to find the model most likely to have generated the data.
They are different because they come from different models (generative vs. discriminative). The generative model would allow you to evaluate the likelihood of new pairs (x,y). The discriminative model allows you to predict the likelihood of different values of y given a value of x.

The generative model also has MORE to learn, since (in theory), you can always marginalize out y to get p(x), and then diving the generative probability by that, you have p(y | x), the discriminative model.

score 1 · Answer 2 · answered May 06 '19 at 15:08

What is the reason of maximizing these terms?

If we view it from a different perspective, we are optimizing the joint distribution in both cases. For the generative model the joint distribution factorizes this way:

$$p(D, \theta)=p(\theta)\prod_i p(x_i, y_i|\theta)=p(\theta)\prod_i p(x_i|y_i, \lambda)p(y_i|\pi)$$

where $\theta$ consists of $\lambda$ and $\pi$.

But for the discriminative model the joint distribution factorizes this way:

$$q(D, \theta, \theta')=p(\theta)p(\theta')\prod_i q(x_i, y_i|\theta, \theta')=p(\theta)p(\theta')\prod_i p(y_i|x_i, \theta)p(x_i|\theta')$$

where $\theta$ is the same as that if we ignore $\theta'$ and don't model the distribution over $X$, and $\theta'$ is independent of $\theta$ and $p(x_i|\theta')$ is obtained by $\sum_y p(x, y|\theta')$. We can imagine that the $X$ is modeled when we are training discriminative models but it doesn't affect the $\theta$.

For more information please refer to this answer.

Why are these optimization objectives different? What is the meaning of this difference?

Because we apply them in different scenarios. In the generative models we assume that X's are independent given Y, but in discriminative models we don't have such an assumption and normally all $X$'s are given. And normally speaking, the generative models can deal with missing data cases.

Reference:
1. Probabilistic Graphical Models_ Principles and Techniques (2009, The MIT Press)
2. Discriminative models, not discriminative training

Why is the objective different in discriminative and generative learning?

2 Answers2

Linked