What is the reason of maximizing these terms?
If we view it from a different perspective, we are optimizing the joint distribution in both cases. For the generative model the joint distribution factorizes this way:
$$p(D, \theta)=p(\theta)\prod_i p(x_i, y_i|\theta)=p(\theta)\prod_i p(x_i|y_i, \lambda)p(y_i|\pi)$$
where $\theta$ consists of $\lambda$ and $\pi$.
But for the discriminative model the joint distribution factorizes this way:
$$q(D, \theta, \theta')=p(\theta)p(\theta')\prod_i q(x_i, y_i|\theta, \theta')=p(\theta)p(\theta')\prod_i p(y_i|x_i, \theta)p(x_i|\theta')$$
where $\theta$ is the same as that if we ignore $\theta'$ and don't model the distribution over $X$, and $\theta'$ is independent of $\theta$ and $p(x_i|\theta')$ is obtained by $\sum_y p(x, y|\theta')$. We can imagine that the $X$ is modeled when we are training discriminative models but it doesn't affect the $\theta$.
For more information please refer to this answer.
Why are these optimization objectives different? What is the meaning
of this difference?
Because we apply them in different scenarios. In the generative models we assume that X's are independent given Y, but in discriminative models we don't have such an assumption and normally all $X$'s are given. And normally speaking, the generative models can deal with missing data cases.
Reference:
1. Probabilistic Graphical Models_ Principles and Techniques (2009, The MIT Press)
2. Discriminative models, not discriminative training