Expectation Maximization and Deep Learning

Question

Most papers define the learning as maximizing the expectation given some function.

Is the expectation maximization applicable to most deep learning models in literature?
Can learning be generalized to EM algorithm?
Most of the models can be treated as probability functions, but when is this not the case?
Can we express every deep learning model as part of EM training algorithm?

EM algorithm is generally used when there is missing information, either an observable variable with some values not available, or you have a model with what's called a latent variable (unobservable). Otherwise direct estimation methods are faster. — Cagdas Ozgenc, Feb 04 '20 at 20:25
Right but for a DNN we train in batches, wouldn't that make it equivalent. In all papers I read they define a DNN learning as an EM step. e.g. https://arxiv.org/pdf/1706.06083.pdf section 2 — Iordanis, Feb 04 '20 at 20:29

Skander H. · Answer 1 · 2020-03-19T06:44:35.587

(The original version of this post, the text of which I kept below the line for reference purposes, generated a lot of dispute and some back and forth which seems mostly to be around questions of interpretation and ambiguity, so I updated with a more direct answer)

The OP seems to be asking:

Are Deep Learning models a special case of the EM algorithm?

No they aren't. Deep Learning models are general purpose function approximators, which can use different types of objective functions and training algorithms, whereas the EM algorithm is a very specific algorithm in terms of training approach and objective function.

From this perspective, it is possible (although not very common) to use Deep Learning to emulate the EM algorithm. See this paper.

Most of the (deep learning) models can be treated as probability functions, but when is this not the case?

Probability distribution functions have to satisfy certain conditions such as summing up to one (the conditions are slightly different if you consider probability density functions). Deep Learning models can approximate functions in general - i.e. a larger class of functions than those that correspond to probability distributions and densities.

When do they not correspond to probability densities and distributions? Any time the function they approximate doesn't satisfy the axioms of probability theory. For example a network whose output layer has $tanh$ activations can take negative values, and therefore doesn't satisfy the condition for being a probability distribution or density.

There are three ways that a deep learning model can correspond to a probability distribution $P(x)$:

Use Deep Learning to learn a probability distribution directly. That is, have your neural network learn the shape of $y=P(x)$. Here $P(x)$ satisfies the conditions for being a probability density or distribution.
Have a Deep Learning model learn a general function $y=f(x)$ (that doesn't satisfy the conditions for being a probability distribution). After training the model, we then make assumptions about the probability distribution $P(y|x)$, e.g. the errors are normally distributed, and then use simulations to sample from that distribution. See here for an example of how that can be done.
Have a Deep Learning model learn a general function $y=f(x)$ (that doesn't satisfy the conditions for being a probability distribution) - and then interpret the output of the model as representing $P(y|x) = ~ \delta[y-f(x)]$, with $\delta$ being the Dirac function. There are two issues with this last approach. There is some debate as to whether the Dirac Delta constitutes a valid distribution function. It is popular in the signal processing and physics communities, but not so much among the probability and statistics crowd. It also doesn't provide any useful information from a probability and statistics point of view, since it doesn't provide anyway of quantifying the uncertainty of the output, which defeats the purpose of using a probabilistic model in practice.

Is the expectation maximization applicable to most deep learning models in literature?

Not really. There are several key differences:

Deep Learning models work by minimizing a loss function. Different loss functions are used for different problems, and then the training algorithm used focuses on the best way to minimize the particular loss function that is suitable for the problem at hand. The EM algorithm on the other hand, is about maximizing a likelihood function. The issue here isn't simply that we are maximizing instead of minimizing (both are optimization problems after all), but the fact that EM dictates a specific function to be optimized, whereas Deep Learning can use any loss function as long as it is compatible with the training method (which is usually some variant of Gradient Descent).
EM estimates the parameters in a statistical method by maximizing the likelihood of those parameters. So we chose the model before hand (e.g. a Gaussian with mean $\mu$ and variance $\sigma^2$), and then we use EM to find the best values of those parameters (e.g. which values of $\mu$ and $\sigma^2$ best fit our data). Deep Learning models are non-parametric, they don't make any assumptions about the shape or distribution of the data. Instead they are universal approximators, which given enough neurons and layers, should be able to fit any function.
Closely related to the previous point is the fact that Deep Learning models are just function approximators, that can approximate arbitrary functions without having to respect any of the constraints that are imposed on a probability distribution function. An MLE model, or even a non-parametric distribution estimator for that matter, will be bound by the laws of probability and the constraints imposed on probability distributions and densities.

Now certain types of deep learning models can be considered equivalent to an MLE model, but what is really happening under the hood is that we specifically asking the neural network to learn a probability distribution as opposed to a more general arbitrary function by choosing certain activation functions and adding some constraints on the outputs of the network. All that means is that they are acting as MLE estimators, but not that they are special cases of the EM algorithm.

Is the learning considered to be part of the EM algorithm?

I would say that it is the other way around. It is possible that someone, somewhere, has come up with a Deep Learning model that is equivalent to the EM algorithm, but that would make the EM algorithm a special case of Deep Learning, not the other way around, since for this to work, you would have to use Deep Learning + additional constraints to make the model mimic EM.

In response to the comments:

"Minimizing and maximizing can be the same thing.": Agreed, they can be (almost) equivalent - which what I specified in my response - it is NOT about maximizing vs. minimizing, it is about having to use a specific objective function dictated by MLE, vs. being able to use just about any loss function compatible with backpropagation. "The loss function in this case is the expectation of E p_theta(x|z) where p_theta is the deep neural network model." - again this is possible, but as I point out later, this would make MLE a special case of Deep Learning, not the other way around.
"Parameters in the case of the deep neural networks are the model weights. I don't think your explanation is correct" - the explanation is correct, but you are also correct that the word parametric is ambiguous, and is used in different ways in different sources. Model weights are parameters in the general sense, but in the strict sense of parametric vs. non-parametric models, they aren't the same as the parameters in a parametric model. In parametric model, the parameters have a meaning, they correspond to the mean, the variance, the seasonality in a time series, etc...whereas the parameters in a Deep Learning model don't having any meaning, they are jus the most convenient way for the network to store information. That is why neural networks are criticized for being black box - there is no established way of interpreting the meaning of the weights. Another way you can think of it is in terms of total parameters vs. number of effective parameters: In a truly parametric model that is estimated using EM, the number of fixed parameters is the same as the number of effective parameters. In a neural network, the number of effective parameters may change during training (by reducing weights to zero, or by using drop out, etc....), even if the total number of parameters is defined before hand and is fixed. Which brings us to the real difference between the two approaches: A fixed number of effective parameters means that the shape of the distribution or function is decided before hand, whereas changing effective parameters allows for models to approximate more general, and eventually arbitrary functions, per the universal approximation theorem.
"DNN also try to learn the probability distribution of the data in order to make predictions." only if we configure and constrain them to learn probability distributions. But they can also learn other things besides probability distributions. To this how this is possible, you can simply specify a multi-class neural network, with 4 outputs, with sigmoid outputs instead of softmax outputs, and train it to learn cases where the output is [1, 1, 1, 1]. Since the sum of the outputs is > 1, this is not a probability distribution, but just an arbitrary mapping of the inputs to classes. More generally Neural Networks/Deep Learning models are just general purpose function approximators, which can be configured to the specific case of estimation probability distribution functions, but they are not limited to that case. In computer vision for example, the are often used as filters and segmentation devises, instead of as classifiers or distribution estimators.

As Cagdas Ozgenc points out, just about any supervised learning problem or function approximation problem can be recast as an MLE.

1. Minimizing and maximizing can be the same thing. e.g. if you are minimizing the negative of a function then you are maximizing that function so I don't agree with you there. The loss function in this case is the expectation of E p_theta(x|z) where p_theta is the deep neural network model. — Iordanis, Feb 04 '20 at 19:37
2. Parameters in the case of the deep neural networks are the model weights. I don't think your explanation is correct. — Iordanis, Feb 04 '20 at 19:38
3. DNN also try to learn the probability distribution of the data in order to make predictions. — Iordanis, Feb 04 '20 at 19:39
I think the EM training, e.g. calculating expectation and then maximizing that expectation (or minimizing the negative of it) is somehow rooted in the learning objective of all DNN but I am not sure about it or have enough evidence to support that. — Iordanis, Feb 04 '20 at 19:41
1. That is a good argument and what my concern is, but I can't find examples that training of a DNN can not be expressed as a MLE problem. EM can be applied to any probability function, and the objective (loss function) can be expressed as a probability function. To address 3. In that case you are learning 4 probability distributions. But still you are learning probability distributions. — Iordanis, Feb 04 '20 at 20:22
@Iordanis just take the last example I provided, but with the output targets being [2,2,2,2] instead of [1,1,1,1] - and you will have a case which is not an MLE problem and which isn't a case of learning a distribution. — Skander H., Feb 04 '20 at 20:33
This answer is flawed in so many ways. Let a network just approximate a deterministic function $y = f(x)$, then you can always think of it as modeling the following probability distribution $p(y|x) \sim \delta[y-f(x)]$, where $\delta$ is the dirac-delta. — Cagdas Ozgenc, Feb 04 '20 at 20:36
@CagdasOzgenc by your logic, any supervised learning problem, function approximation problem, or smoothing problem can be cast as an MLE. Historically, AFAIK, neural networks, started out as biological models of brain neurons and then as function approximators, and only later were they reconciled with the theoretical constraints of statistical learning theory. Even today, it is still very difficult to extract confidence intervals and uncertainty measure from a neural network, which is very straight forward if you are using parametric statistical methods. — Skander H., Feb 04 '20 at 20:46
@CagdasOzgenc that being said, I will append your concern to my answer. — Skander H., Feb 04 '20 at 20:47
Neural networks are parametric models. http://mlss.tuebingen.mpg.de/2015/slides/ghahramani/gp-neural-nets15.pdf. And I didn’t write anything about MLE — Cagdas Ozgenc, Feb 04 '20 at 21:09
@CagdasOzgenc see my comment on how the term parametric is itself ill-defined and the difference between all parameters and effective parameters, as well as parameter interpretability. The link you posted to mentions only Bayesian Deep Learning models, not Deep Learning in general. — Skander H., Feb 04 '20 at 21:14
As pointed by others, minimizing loss is equivalent to maximizing it's inverse, e.g. minimizing squared loss is equivalent of maximizing Gaussian likelihood. You can define neural network in terms of some likelihood function to maximize, this is often done. Moreover, probabilistic model does is not anyhow constrained so that it cannot be function approximator, e.g. Gaussian processes are probabilistic function approximators. Example for both points: Bayesian neural networks, where you define some likelihood and use priors for the parameters. — Tim, Feb 04 '20 at 21:18
@Iordanis see my latest update (I've wasted waaaay tooo much time on this thread :-( ....) — Skander H., Feb 05 '20 at 00:17
Sigmoid outputs can still very easily map onto probabilities. In fact they are the natural choice when you want to assign probabilities to multiple binary events. The fact that the numbers don't add up to 1 doesn't mean that each of them individually isn't a probability. Summing to 1 is only a requirement for distributions over mutually exclusive possibilities. I agree that DNNs don't necessarily have a probabilistic interpretation, but as long as you're optimizing cross-entropies between binary labels and outputs between 0-1, a probabilistic interpretation is never far away. — Ruben van Bergen, Feb 05 '20 at 00:29
@RubenvanBergen is anybody actually reading the post in detail? I don't say anywhere in my answer that DNN can't or shouldn't be interpreted as probability densities, only that they are *more general* than probability models. The OP was asking if DNN are a special case of EM - and they are not. — Skander H., Feb 05 '20 at 00:34

score 3 · Answer 2 · answered Feb 04 '20 at 21:44

In short, no.

Expectation maximization is a technique to solve statistical problems that consist of an "easy" maximization (if some latent variables were known), and an "easy" expectation calculation on the log-likelihood (if the parameters were known). However, the "how" and "why" the expectation and maximization steps require ingenuity, and are model-specific. So while it's possible that some models from deep learning could be posed in fashion that might leverage EM, EM is not a generic optimization technique, not even to classical statistical models.

EM as minorization/maximization

However, EM can be considered a member of a class of algorithms known as minorization-maximization (MM) algorithms. These algorithms find a surrogate that is a lower bound for the objective function everywhere, but tight at least one point. The surrogate is maximized (it should be constructed so that it's easier to maximize than the original function), and the process repeated. Finding such a surrogate also requires ingenuity or structure, but it can be thought as a generic technique in optimization. So in that sense, the theory behind EM is broadly be applicable.

A quick search of google scholar reveals some relevant literature, though it seems to be much less commonly used than stochastic gradient methods, which do not attempt to construct a surrogate.

I don't understand why you draw a distinction between EM and Gradient Descent. They seem to me 2 different things. We can take the gradient of the Expectation step and for the maximization (or minimization depending on the sign) we can consider the gradient since we can only do a batch at a time. — Iordanis, Feb 04 '20 at 22:42
@lordanis, many models that apply EM don't require an explicit calculation of the gradient of the expected complete likelihood because the maximization is done exactly and analytically. For instance, the Gaussian finite mixture model. I don't understand what you mean by a "gradient of the expectation step." — Andrew M, Feb 06 '20 at 16:37

Sextus Empiricus · Answer 3 · 2020-02-06T01:44:21.210

Short overview about Expectation maximization :

Marginal likelihood

Expectation maximization contrasts with 'regular' likelihood maximization by refering to the maximization of a marginal likelihood.

$$\underbrace{p(X\vert \theta)}_{\substack{\text{marginal likelihood}\\\text{ $\mathcal{L}(\theta \vert X)$}}} = \int_z \underbrace{p(X, z \vert \theta)}_{\substack{\text{likelihood}\\\text{ $\mathcal{L}(\theta \vert X,\underset{\uparrow \\ \substack{\llap{\text{This $z$ is }\rlap{\text{missing data}}}}}{z})$}}} \text{d}z = \int_z p(X \vert \theta,z) p(z\vert X,\theta) \text{d}z $$

So this relates to an integral over some likelihood with an additional parameter $z$ (e.g. missing data).
EM algorithm

In the EM algorithm this integral is not maximized directly:

$$\hat\theta = \underset{\theta}{\text{arg max}} \left( \int_z p(X \vert \theta,z) p(z\vert X,\theta) \text{d}z \right)$$

but instead it is computed in an iterative way:

$$\hat\theta_{k+1} = \underset{\theta}{\text{arg max}} \left( \int_z p(X \vert \theta,z) p(z\vert X,\hat\theta_k) \text{d}z \right)$$

This is done by picking an initial $\theta_1$ and updating repetitively. Note that now the optimization keeps the term $p(z\vert X,\theta)$ fixed (which helps to minimize by expressing derivatives).

Not all problems are like that.

So this marginal likelihood is only the case for problems with unobserved data. For instance in finding a Gaussian Mixture (example on wikipedia) one may consider an observed variable $z$ that refers the class (which component in the mixture) that a measurement belongs to.

There are many problems that do not consider a marginal likelihood and evaluate parameters directly in order to optimize some likelihood function (or some other cost function but not a marginalized/expectation one).

Code example

The use of marginal likelihood is not always about explicitly missing data.

In the example below the (example on wikipedia) is worked out numerically in R.

In this case you do not have explicitly a case with missing data (the likelihood can be defined directly and does not need to be a marginal likelihood integrating over missing data), but one has a mixture of two multivariate Gaussian distributions. The problem with that is that one can not do as normally and compute the sum of the logarithm of the terms (which has computational advantages), because not the terms are not a product, but they involve a sum as well. Although, it would be possible to compute those logarithms of sums using an approximation (this is not done in the code below, instead the parameters are created in order to be advantageous and do not generate infinite values).

library(MASS)

# data example
set.seed(1)
x1 <- mvrnorm(100, mu=c(1,1), Sigma=diag(1,2))
x2 <- mvrnorm(30,  mu=c(3,3), Sigma=diag(sqrt(0.5),2))
x <- rbind(x1,x2)
col <- c(rep(1,100),rep(2,30))
plot(x,col=col)

# Likelihood without integrating over z 
Lsimple <- function(par,X=x) {
  tau <- par[1]
  mu1 <- c(par[2],par[3])
  mu2 <- c(par[4],par[5])
  sigma_1 <- par[6]
  sigma_2 <- par[7]

  likterms <-    tau*dnorm(X[,1],mean = mu1[1], sd = sigma_1)*
                     dnorm(X[,2],mean = mu1[2], sd = sigma_1)+
             (1-tau)*dnorm(X[,1],mean = mu2[1], sd = sigma_2)*
                     dnorm(X[,2],mean = mu2[2], sd = sigma_2)
  logLik = sum(log(likterms))
  -logLik
}

# Marginal likelihood integrating over z
LEM <- function(par,X=x,oldp=oldpar) {
  tau <- par[1]
  mu1 <- c(par[2],par[3])
  mu2 <- c(par[4],par[5])
  sigma_1 <- par[6]
  sigma_2 <- par[7]

  oldtau <- oldp[1]
  oldmu1 <- c(oldp[2],oldp[3])
  oldmu2 <- c(oldp[4],oldp[5])
  oldsigma_1 <- oldp[6]
  oldsigma_2 <- oldp[7]

  f1 <-     oldtau*dnorm(X[,1],mean = oldmu1[1], sd = oldsigma_1)*
                   dnorm(X[,2],mean = oldmu1[2], sd = oldsigma_1)
  f2 <- (1-oldtau)*dnorm(X[,1],mean = oldmu2[1], sd = oldsigma_2)*
                   dnorm(X[,2],mean = oldmu2[2], sd = oldsigma_2)
  pclass <- f1/(f1+f2)

  ### note that now the terms are a product and can be replaced by a sum of the log
  #likterms <-    tau*dnorm(X[,1],mean = mu1[1], sd = sigma_1)*
  #                   dnorm(X[,2],mean = mu1[2], sd = sigma_1)*(pclass)*
  #           (1-tau)*dnorm(X[,1],mean = mu2[1], sd = sigma_2)*
  #                   dnorm(X[,2],mean = mu2[2], sd = sigma_2)*(1-pclass)
  loglikterms <-   (log(tau)+dnorm(X[,1],mean = mu1[1], sd = sigma_1, log = TRUE)+
                             dnorm(X[,2],mean = mu1[2], sd = sigma_1, log = TRUE))*(pclass)+
                 (log(1-tau)+dnorm(X[,1],mean = mu2[1], sd = sigma_2, log = TRUE)+
                             dnorm(X[,2],mean = mu2[2], sd = sigma_2, log = TRUE))*(1-pclass)
  logLik = sum(loglikterms)
  -logLik
}


# solving with direct likelihood
par <- c(0.5,1,1,3,3,1,0.5)
p1 <- optim(par, Lsimple, 
          method="L-BFGS-B",
          lower = c(0.1,0,0,0,0,0.1,0.1),
          upper = c(0.9,5,5,5,5,3,3),
          control = list(trace=3, maxit=10^3)) 
p1

# solving with LEM 
# (this is done here indirectly/computationally with optim,
# but could be done analytically be expressing te derivative and solving)
oldpar <- c(0.5,1,1,3,3,1,0.5)

for (i in 1:100) {
  p2 <- optim(oldpar, LEM, 
             method="L-BFGS-B",
             lower = c(0.1,0,0,0,0,0.1,0.1),
             upper = c(0.9,5,5,5,5,3,3),
             control = list(trace=1, maxit=10^3))  
  oldpar <- p2$par
  print(i)
}
p2

# the result is the same:
p$par
p2$par