4

I'm trying to make sense of a derivation I'm following from the lecture notes of Stanford's ML course. Specifically the notes are here: http://cs229.stanford.edu/notes/cs229-notes8.pdf

I'm particular I'm interested in the step where they are optimizing the lower bound of the log-likelihood by finding the derivative of that function w.r.t the mean (top of page 7). I'm a bit stuck on going from line 1 to line 2 in that sequence. To copy here (using abbreviated notation for space):

$$ \nabla \mu_s \sum_{i=1}^{m}\sum_{j=1}^{k} w_j^{(i)}\log\frac{\alpha_j \exp\psi_{j}^{(i)}\phi_j}{w_j^{(i)}} \qquad(1) $$ $$ \nabla \mu_s \sum_{i=1}^{m}\sum_{j=1}^{k} w_j^{(i)}\psi_{j}^{(i)}\qquad(2) $$ where $\psi_{j}^{(i)} =-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma^{-1}_{j}(x^{(i)}-\mu_j)$ and $\alpha_j=\frac{1}{(2\pi)^{n/2}|\Sigma_j|^{1/2}}$. The quantity $\mu_s$ is the mean (of a multivariate gaussian distribution) for class $s$.

In short, I'm not sure how they get from (1) to (2). From just changing the log of (1) I see:

$$ \nabla \mu_s \sum_{i=1}^{m}\sum_{j=1}^{k} w_j^{(i)}\log \left(\alpha_j \exp\psi_{j}^{(i)}\phi_j\right) - w_j^{(i)}\log w_j^{(i)} \qquad(1) $$

To get to their form (2) my impression is that they take pre-emptively discard terms that would contain derivatives of $w_j^{(i)}$ (w.r.t. $\mu_s$). At least if you assume that much, you get to (2). However, as they show earlier in that PDF,

$$ w_j^{(i)} = p(z^{(i)}=j | x^{(i)}) =\frac{p(x^{(i)}|z^{(i)}=j)p(z^{(i)}=j) }{\sum_j p(x^{(i)},z^{(i)}=j) } $$ i.e. this quantity is the posterior distribution of the class labels. Intuitively, it seems like $\partial w_j^{(i)}/ \partial \mu_s$ should be non-zero. After all, changing the location of the mean $\mu_s$ would change your posterior probability for a particular class given the same data. (Right?). Just from a mechanical point of view, $\partial w_j^{(i)}/\partial \mu_j$ would seemingly have some non-zero terms since $p(x^{(i)}|z^{(i)}=j) \sim \mathcal{N}(\mu_j, \Sigma_j)$. If I work through the algebra, I get

$$ \frac{\partial w_j^{(i)}}{\partial \mu_j} = w_j^{(i)}(1-w_j^{(i)})\frac{\partial \psi^{(i)}_{j}}{\partial \mu_j} $$

What am I missing here? If I take the log-likelihood (without even thinking about EM algorithm),

$$ \ell = \sum_{i=1}^{m}\log \sum_{j=1}^{k}p(x^{(i)}|z^{(i)}=j)p(z^{(i)}=j) $$ and differentiate w.r.t $\mu_s$, I correctly get

$$ \mu_s^* = \frac{\sum_{i=1}^{m}w_s^{(i)}x^{(i)}}{\sum_{i=1}^{m}w_s^{(i)}} $$

so I'm comfortable with the eventual result, just not the intermediate steps.

blawney_dfci
  • 215
  • 1
  • 6

1 Answers1

3

The most difficult part in the EM algorithm [in my teaching experience] is the co-existence of two sets of parameters $\theta$, one that is the "current" value of the parameter, $\theta^{(t)}$, obtained at the previous iteration $t$, and one that is free [and free to optimised], both within the target "E" function $$Q(\theta;\theta^{(t)}) = \mathbb{E}_{\theta^{(t)}}\left[\log L^c(\theta|X,Z) | X=x \right]$$ In the case of a mixture model, this $Q(\theta;\theta^{(t)})$ function writes down as $$\sum_{i=1}^n \sum_{j=1}^k \mathbb{P}_{\theta^{(t)}}\left[Z_i=j|X_i=x_i\right|\left\{\log(w_j)+ \log(f(x_i|\theta_j) \right\}]$$ where $\theta_j$ denotes the parameter(s) of the $j$th component of the mixture (and $w_j$ the known or unknown weight of this components). In this expression, the conditional probability only depends on the "current" value of the parameter, $\theta^{(t)}$, hence is not to differentiated in terms of the free parameter.

Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • 1
    Since this is now it's own reply, I will say that this was very helpful. Re-reading the original notes I can now see this subtlety more clearly. Thank you! – blawney_dfci Oct 30 '17 at 18:35