Deriving K-means algorithm as a limit of Expectation Maximization for Gaussian Mixtures

Question

Christopher Bishop defines the expected value of the complete-data log likelihood function (i.e. assuming that we are given both the observable data X as well as the latent data Z) as follows:

$$ \mathbb{E}_\textbf{Z}[\ln p(\textbf{X},\textbf{Z} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}, \boldsymbol{\pi})] = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk})\{\ln \pi_k + \ln \mathcal{N}(\textbf{x}_n \mid \ \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\} \tag 1 $$

where $\gamma(z_{nk})$ is defined as:

$$ \frac{\pi_k \mathcal{N}(\textbf{x}_n \mid \ \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\textbf{x}_n \mid \ \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \tag 2 $$

The idea, as described, is to consider a Gaussian Mixture Model in which the covariance matrices of the mixture components are given by $\epsilon \textbf{I}$, where $\epsilon$ is a variance parameter that is shared by all of the components, such that:

$$ p(\textbf x \mid \boldsymbol \mu_k, \boldsymbol \Sigma_k) = \frac{1}{(2 \pi \epsilon)^\frac{M}{2}} \exp\big\{{-\frac{1}{2 \epsilon} \|\textbf x - \boldsymbol \mu_k\|^2}\big\} \tag 3 $$

and so, $\gamma(z_{nk})$ is now defined as:

$$ \frac{\pi_k \exp\{ - \| \textbf x_n - \boldsymbol \mu_k\|^2 / 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{ - \| \textbf x_n - \boldsymbol \mu_j\|^2 / 2 \epsilon\}} \tag 4 $$

The argument now is the following:

if we consider the limit $\epsilon \to 0$, we see that in the denominator the term for which $\| \textbf x_n - \boldsymbol \mu_j\|^2$ is smallest, will go to zero most slowly, and hence the responsibilities $\gamma(z_{nk})$ for the data point $\textbf x_n$ all go to zero except for term j, for which the responsibility $\gamma(z_{nk})$ will go to unity. Thus, in this limit, we obtain a hard assignment of data points to clusters, just as in the $K$-means algorithm, so that $\gamma(z_{nk}) \to r_{nk}$

where $r_{nk}$ is defined as:

\begin{equation*} f(n) = \begin{cases} 1 & \text{if } k = \text{arg } \text{min}_j \|\textbf x_n - \boldsymbol \mu_j\|^2\\ 0 & \text{otherwise}\\ \tag 5 \end{cases} \end{equation*}

My question is how does the above argument hold? Namely, what does it mean for a term to go to zero $\textbf{most slowly}$ ? And how does taking the limit $\epsilon \to 0$ in eqn $4$ result in a binary responsibility?

When $\epsilon$ goes to zero, $\exp\{ - \| \textbf x_n - \boldsymbol \mu_k\|^2 / 2 \epsilon\}=\exp\{-\delta_n/\epsilon\}$ goes to zero for all $n$'s but at different speeds depending on $\delta_n$, the smallest $\delta_n$ then gather the whole weight in the limit. — Xi'an, Jan 11 '15 at 12:53
(further explanation) If you take $\delta^*$ as the smallest $\delta_n$, you can rewrite all terms as $\exp\{(\delta^*-\delta_n)/\epsilon\}$, which means all terms go to zero with $\epsilon$ except one, the one for which $\delta^*-\delta_n=0$. — Xi'an, Jan 11 '15 at 12:55
@Xi'an Do you care to provide more elaboration? What do you mean "the smallest $\delta_n$ then gather the whole weight in the limit"? And how does the term for which $\delta^* - \delta_n$ = 0 evaluate to unity? I mean, the numerator is 0, right? — BitRiver, Jan 11 '15 at 13:40

Xi'an · Accepted Answer · 2015-01-11T19:59:34.583

Let us write $$\|\textbf x_n - \boldsymbol \mu_k\|^2=\delta_k\,.$$ Then $$\frac{\pi_k \exp\{ - \| \textbf x_n - \boldsymbol \mu_k\|^2 / 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{ - \| \textbf x_n - \boldsymbol \mu_j\|^2 / 2 \epsilon\}}=\frac{\pi_k \exp\{ - \delta_k/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{ - \delta_j/ 2 \epsilon\}}$$ If we take$$\delta^*=\min_n\delta_n\,,$$ we have \begin{align*} \frac{\pi_k \exp\{ - \delta_k/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{ - \delta_j/ 2 \epsilon\}}&=\frac{\pi_k \exp\{(\delta^*- \delta_k)/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}} \end{align*} where $\delta^*-\delta_k<0$ except for $k=k^*$ where $\delta^*-\delta_{k^*}=0$. So, for all $k\ne k^*$, $$\lim_{\epsilon\to 0} \frac{\pi_k \exp\{(\delta^*- \delta_k)/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}}=\lim_{\epsilon\to 0} \frac{\pi_k \exp\{(\delta^*- \delta_k)/ 2 \epsilon\}}{\pi_{k^*}+\sum_{j\ne k^*} \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}}=0$$ since, for $a>0$, $$\lim_{\epsilon\to 0}\exp\{-a/\epsilon \}=0$$ while $$\lim_{\epsilon\to 0} \frac{\pi_{k^*} \exp\{(\delta^*- \delta_{k^*})/ 2 \epsilon\}}{\sum_{j=1}^K \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}}=\lim_{\epsilon\to 0} \frac{\pi_{k^*} \times 1}{\pi_{k^*}+\sum_{j\ne k^*} \pi_j \exp\{(\delta^* - \delta_j)/ 2 \epsilon\}}=1$$

Deriving K-means algorithm as a limit of Expectation Maximization for Gaussian Mixtures

1 Answers1

Linked