EM algorithm manually implemented

Question

I want to implement the EM algorithm manually and then compare it to the results of the normalmixEM of mixtools package. Of course, I would be happy if they both lead to the same results. The main reference is Geoffrey McLachlan (2000), Finite Mixture Models.

I have a mixture density of two Gaussians, in general form, the log-likelihood is given by (McLachlan page 48):

$$ \log L_c(\Psi) = \sum_{i=1}^g \sum_{j=1}^n z_{ij}\{\log \pi_i + \log f_i(y_i;\theta_i)\}. $$ The $z_{ij}$ are $1$, if the observation was from the $i$^th component density, otherwise $0$. The $f_i$ is the density of the normal distribution. The $\pi$ is the mixture proportion, so $\pi_1$ is the probability, that an observation is from the first Gaussian distribution and $\pi_2$ is the probability, that an observation is from the second Gaussian distribution.

The E step is now, calculation of the conditional expectation:

$$ Q(\Psi;\Psi^{(0)}) = E_{\Psi(0)}\{\log L_c(|\Psi)|y\}. $$ which leads, after a few derivations to the result (page 49):

\begin{align} \tau_i(y_j;\Psi^{(k)}) &= \frac{\pi_i^{(k)}f_i(y_j;\theta_i^{(k)}}{f(y_j;\Psi^{(k)}} \\[8pt] &= \frac{\pi_i^{(k)}f_i(y_j;\theta_i^{(k)}}{\sum_{h=1}^g \pi_h^{(k)}f_h(y_j;\theta_h^{(k)})} \end{align} in the case of two Gaussians (page 82):

$$ \tau_i(y_j;\Psi) = \frac{\pi_i \phi(y_j;\mu_i,\Sigma_i)}{\sum_{h=1}^g \pi_h\phi(y_j; \mu_h,\Sigma_h)} $$ The M step is now the maximization of Q (page 49):

$$ Q(\Psi;\Psi^{(k)}) = \sum_{i=1}^g\sum_{j=1}^n\tau_i(y_j;\Psi^{(k)})\{\log \pi_i + \log f_i(y_j;\theta_i)\}. $$ This leads to (in the case of two Gaussians) (page 82):

\begin{align} \mu_i^{(k+1)} &= \frac{\sum_{j=1}^n \tau_{ij}^{(k)}y_j}{\sum_{j=1}^n \tau_{ij}^{(k)}} \\[8pt] \Sigma_i^{(k+1)} &= \frac{\sum_{j=1}^n \tau_{ij}^{(k)}(y_j - \mu_i^{(k+1)})(y_j - \mu_i^{(k+1)})^T}{\sum_{j=1}^n \tau_{ij}^{(k)}} \end{align} and we know that (p. 50)

$$ \pi_i^{(k+1)} = \frac{\sum_{j=1}^n \tau_i(y_j;\Psi^{(k)})}{n}\qquad (i = 1, \ldots, g). $$ We repeat the E, M steps until $L(\Psi^{(k+1)})-L(\Psi^{(k)})$ is small.

I tried to write a R code (data can be found here).

# EM algorithm manually
# dat is the data

# initial values
pi1       <-  0.5
pi2       <-  0.5
mu1       <- -0.01
mu2       <-  0.01
sigma1    <-  0.01
sigma2    <-  0.02
loglik[1] <-  0
loglik[2] <- sum(pi1*(log(pi1) + log(dnorm(dat,mu1,sigma1)))) + 
             sum(pi2*(log(pi2) + log(dnorm(dat,mu2,sigma2))))

tau1 <- 0
tau2 <- 0
k    <- 1

# loop
while(abs(loglik[k+1]-loglik[k]) >= 0.00001) {

  # E step
  tau1 <- pi1*dnorm(dat,mean=mu1,sd=sigma1)/(pi1*dnorm(x,mean=mu1,sd=sigma1) + 
          pi2*dnorm(dat,mean=mu2,sd=sigma2))
  tau2 <- pi2*dnorm(dat,mean=mu2,sd=sigma2)/(pi1*dnorm(x,mean=mu1,sd=sigma1) + 
          pi2*dnorm(dat,mean=mu2,sd=sigma2))

  # M step
  pi1 <- sum(tau1)/length(dat)
  pi2 <- sum(tau2)/length(dat)

  mu1 <- sum(tau1*x)/sum(tau1)
  mu2 <- sum(tau2*x)/sum(tau2)

  sigma1 <- sum(tau1*(x-mu1)^2)/sum(tau1)
  sigma2 <- sum(tau2*(x-mu2)^2)/sum(tau2)

  loglik[k] <- sum(tau1*(log(pi1) + log(dnorm(x,mu1,sigma1)))) + 
               sum(tau2*(log(pi2) + log(dnorm(x,mu2,sigma2))))
  k         <- k+1
}


# compare
library(mixtools)
gm <- normalmixEM(x, k=2, lambda=c(0.5,0.5), mu=c(-0.01,0.01), sigma=c(0.01,0.02))
gm$lambda
gm$mu
gm$sigma

gm$loglik

The algorithm is not working, since some observations have the likelihood of zero and the log of this is -Inf. Where is my mistake?

The problem is not a statistical one, but rather a numerical one. You should add contingencies for likelihoods smaller than machine precision in your code. — JohnRos, May 02 '13 at 15:58
why dont you try veryfying the mixtools function with a very simple example that can be verified by hand , say just five or ten values and two timeseries,first. then, if you find it works there, generalize your code and verify at each step. — , Dec 27 '13 at 18:05

zhanxw · Accepted Answer · 2013-05-03T02:30:37.377

You have several problems in the source code:

As @Pat pointed out, you should not use log(dnorm()) as this value can easily go to infinity. You should use logmvdnorm
When you use sum, be aware to remove infinite or missing values
You looping variable k is wrong, you should update loglik[k+1] but you update loglik[k]
The initial values for your method and mixtools are different. You are using $\Sigma$ in your method, but using $\sigma$ for mixtools(i.e. standard deviation, from mixtools manual).
Your data do not look like a mixture of normal (check histogram I plotted at the end). And one component of the mixture has very small s.d., so I arbitrarily added a line to set $\tau_1$ and $\tau_2$ to be equal for some extreme samples. I add them just to make sure the code can work.

I also suggest you put complete codes (e.g. how you initialize loglik[]) in your source code and indent the code to make it easy to read.

After all, thanks for introducing mixtools package, and I plan to use them in my future research.

I also put my working code for your reference:

# EM algorithm manually
# dat is the data
setwd("~/Downloads/")
load("datem.Rdata")
x <- dat

# initial values
pi1<-0.5
pi2<-0.5
mu1<--0.01
mu2<-0.01
sigma1<-sqrt(0.01)
sigma2<-sqrt(0.02)
loglik<- rep(NA, 1000)
loglik[1]<-0
loglik[2]<-mysum(pi1*(log(pi1)+log(dnorm(dat,mu1,sigma1))))+mysum(pi2*(log(pi2)+log(dnorm(dat,mu2,sigma2))))

mysum <- function(x) {
  sum(x[is.finite(x)])
}
logdnorm <- function(x, mu, sigma) {
  mysum(sapply(x, function(x) {logdmvnorm(x, mu, sigma)}))  
}
tau1<-0
tau2<-0
#k<-1
k<-2

# loop
while(abs(loglik[k]-loglik[k-1]) >= 0.00001) {
  # E step
  tau1<-pi1*dnorm(dat,mean=mu1,sd=sigma1)/(pi1*dnorm(x,mean=mu1,sd=sigma1)+pi2*dnorm(dat,mean=mu2,sd=sigma2))
  tau2<-pi2*dnorm(dat,mean=mu2,sd=sigma2)/(pi1*dnorm(x,mean=mu1,sd=sigma1)+pi2*dnorm(dat,mean=mu2,sd=sigma2))
  tau1[is.na(tau1)] <- 0.5
  tau2[is.na(tau2)] <- 0.5

  # M step
  pi1<-mysum(tau1)/length(dat)
  pi2<-mysum(tau2)/length(dat)

  mu1<-mysum(tau1*x)/mysum(tau1)
  mu2<-mysum(tau2*x)/mysum(tau2)

  sigma1<-mysum(tau1*(x-mu1)^2)/mysum(tau1)
  sigma2<-mysum(tau2*(x-mu2)^2)/mysum(tau2)

  #  loglik[k]<-sum(tau1*(log(pi1)+log(dnorm(x,mu1,sigma1))))+sum(tau2*(log(pi2)+log(dnorm(x,mu2,sigma2))))
  loglik[k+1]<-mysum(tau1*(log(pi1)+logdnorm(x,mu1,sigma1)))+mysum(tau2*(log(pi2)+logdnorm(x,mu2,sigma2)))
  k<-k+1
}

# compare
library(mixtools)
gm<-normalmixEM(x,k=2,lambda=c(0.5,0.5),mu=c(-0.01,0.01),sigma=c(0.01,0.02))
gm$lambda
 gm$mu
gm$sigma

gm$loglik

Historgram Histogram

@zahnxw thanks for your answer, so does that mean, that my code is wrong? So the basi idea is not working? — Stat Tistician, May 03 '13 at 07:16
"I also suggest you put complete codes (e.g. how you initialize loglik[]) in your source code and indent the code to make it easy to read." Well this is my code? the loglik[] is defined as I declared it in the code I posted? — Stat Tistician, May 03 '13 at 07:17
@StatTistician the idea is correct, but the implementation does have flaws. For example, you did not consider under-flow. Also, you looping variable k is confusing, you first set loglik[1] and loglik[2], after entering the while loop, you set loglik[1] again. This is not the natural way to do. My suggestion about initializing loglik[] means code: `loklik — zhanxw, May 03 '13 at 18:24
As I posted below: Thanks for your help, but I am dropping this topic, since it is too advanced for me. — Stat Tistician, May 04 '13 at 08:41
Is there now a way to determine what part of the data belongs to which mixture? — Cardinal, Apr 09 '19 at 12:45
What is the justification/intuition for setting the missing values to .5? — rsmith54, Sep 25 '19 at 18:25
@rsmith54 I don't get your question. Can you give more details? — zhanxw, Sep 25 '19 at 21:31
@Cardinal you can calculate the probability density function (dnorm) for the data for each of the distribution, say p1 and p2. If p1 > p2, the data belongs to cluster 1. — zhanxw, Sep 25 '19 at 21:33

Pat · Answer 2 · 2013-05-02T15:57:39.030

I keep getting an error when trying to open your .rar file, but that may just be me doing something silly.

I cna see no obvious errors in your code. A possible reason you're getting zeros is due to floating point precision. Remember, when you calculate $f(y;\theta)$, you're evaluating $\exp(-0.5(y-\mu)^2/\sigma^2)$. It doesn't take a very large difference between $\mu$ and $y$ for this to get rounded down to 0 when you do it on a computer. This is doubly noticeable in mixture models, as some of your data will not be "assigned" to each mixture component and so can end up very far away from it. In theory these points should also end up with a low value of $\tau$ when you evaluate the log likelihood, counteracting the problem - but thanks to the floating point error, the quantity has already been evaluated as -Inf by this stage, so it all breaks :).

If that is the problem, there are some possible solutions:

One is to move your $\tau$ inside the logarithm. So instead of evaluating

$\tau \log(f(y|\theta)) $

evaluate

$\log \left( f(y|\theta)^\tau \right)$.

Mathematically the same, but think about what happens when $f(y|\theta)$ and $\tau$ are $\approx 0$. Currently you get:

$0 \log (0) = 0 (-Inf) = NaN$

but with tau moved you get

$\log \left( 0^0\right) = \log(1) = 0$

assuming R evaluates $0^0 = 1$ (I don't know if it does or not as I tend to use matlab)

Another solution is to expand out the stuff inside the logarithm. Assuming you're using natural logarithms:

$\tau \log(f(y|\theta)) $

$= \tau \log(\exp(-0.5(y-\mu)^2/\sigma^2)/\sqrt{2\pi\sigma^2})$

$= -0.5\tau \log(2 \pi\sigma^2) - 0.5 \tau \frac{(y-\mu)^2}{\sigma^2}$.

Mathematically the same, but should be more resilient to floating point errors as you've avoided calculating a large negative power. This means you can't use the built in norm evaluating function any more, but if that's not a problem this is probably the better answer. For example, let's say we have the situation where

$-0.5\frac{(y-\mu)^2}{\sigma^2} = -0.5*40^2 = -800$.

Evaluate that as I've jsut suggested, and you get -800. However, in matlab if we exp the take the log, we get $\log(\exp(-800)) = \log(0) = -Inf$.

mh, to be honest: I am not good enough to get this thing working. What I was interested in is: Can I get the same result with my algorithm as the implemented version of the mixtools package. But from my point of view this seems to be asking for the moon. But I think you put effort into your answer, so I will accept it! Thanks! — Stat Tistician, May 02 '13 at 17:04

EM algorithm manually implemented

2 Answers2

Linked

Related