Distance between two Gaussian mixtures to evaluate cluster solutions

Question

I'm running a quick simulation to compare different clustering methods, and currently hit a snag trying to evaluate the cluster solutions.

I know of various validation metrics (many found in cluster.stats() in R), but I assume those are best used if the estimated number of clusters actually equals the true number of clusters. I want to maintain the ability to measure how well a clustering solution performs when it doesn't specify the correct number of clusters in the original simulation (i.e., how well does a three cluster solution model data that were simulated to have a 4-cluster solution). Just for your information, clusters are simulated to possess identical covariance matrices.

I thought KL divergence between two mixtures of Gaussians would be useful to implement, but no closed form solution exists (Hershey and Olson (2007)) and implementing a Monte Carlo simulation is starting to be computationally expensive.

Are there any other solutions that might be easy to implement (even if just an approximation)?

The L2 distance between two Gaussian mixtures is available in closed form. Use this and you should be all set. — , Feb 07 '14 at 04:36
I don't know how you would do it, but it does not sound like a good idea to me. Take a mixture, permute the components (no change to p(x)) and the L2 distance can be anything. Also, L2 distance is not a good idea on covariance matrices. — bayerj, Oct 09 '14 at 08:23
Posterior predictive probability of a held out test dataset. I suspect you'd need priors on k though. — conjectures, May 14 '15 at 19:37

Danica · Answer 1 · 2018-10-25T10:59:06.833

Suppose we have two Gaussian mixtures in $\mathbb R^d$:$\DeclareMathOperator{\N}{\mathcal N} \newcommand{\ud}{\mathrm{d}} \DeclareMathOperator{\E}{\mathbb E} \DeclareMathOperator{\MMD}{\mathrm{MMD}}$ $$ P = \sum_{i=1}^{n} \alpha_i P_i = \sum_{i=1}^n \alpha_i \N(\mu_i, \Sigma_i) \qquad Q = \sum_{j=1}^m \beta_j Q_j = \sum_{j=1}^m \N(m_j, S_j) .$$ Call their densities $p(\cdot)$ and $q(\cdot)$, respectively, and denote the densities of their components $P_i$, $Q_j$ by $p_i(x) = \N(x; \mu_i, \Sigma_i)$, $q_j(x) = \N(x; m_j, S_j)$.

The following distances are available in closed form:

$L_2$ distance, as suggested in a comment by user39665. This is: \begin{align} L_2(P, Q)^2 &= \int (p(x) - q(x))^2 \,\ud x \\&= \int \left( \sum_{i} \alpha_i p_i(x) - \sum_j \beta_j q_j(x) \right)^2 \ud x \\&= \sum_{i,i'} \alpha_i \alpha_{i'} \int p_i(x) p_{i'}(x) \ud x + \sum_{j,j'} \beta_j \beta_{j'} \int q_j(x) q_{j'}(x) \ud x \\&\qquad - 2 \sum_{i,j} \alpha_i \beta_j \int p_i(x) q_j(x) \ud x .\end{align} Note that, as seen for example in section 8.1.8 of the matrix cookbook: \begin{align} \int \N(x; \mu, \Sigma) \N(x; \mu', \Sigma') \,\ud x &= \N(\mu; \mu', \Sigma + \Sigma') \end{align} so this can be evaluated easily in $O(m n)$ time.
The maximum mean discrepancy (MMD) with a Gaussian RBF kernel. This is a cool distance, not yet super-well-known among the statistics community, that takes a bit of math to define.

Letting $$k(x, y) := \exp\left( - \frac{1}{2 \sigma^2} \lVert x - y \rVert^2 \right),$$ define the Hilbert space $\mathcal{H}$ as the reproducing kernel Hilbert space corresponding to $k$: $k(x, y) = \langle \varphi(x), \varphi(y) \rangle_{\mathcal H}$.

Define the mean map kernel as $$ K(P, Q) = \E_{X \sim P, Y \sim Q} k(X, Y) = \langle \E_{X \sim P} \varphi(X), \E_{Y \sim Q} \varphi(Y) \rangle .$$

The MMD is then \begin{align} \MMD(P, Q) &= \lVert \E_{X \sim P}[\varphi(X)] - \E_{Y \sim Q}[\varphi(Y)] \rVert \\&= \sqrt{K(P, P) + K(Q, Q) - 2 K(P, Q)} \\&= \sup_{f : \lVert f \rVert_{\mathcal H} \le 1} \E_{X \sim P} f(X) - \E_{Y \sim Q} f(Y) .\end{align}

For our mixtures $P$ and $Q$, note that $$ K(P, Q) = \sum_{i, j} \alpha_i \beta_j K(P_i, Q_j) $$ and similarly for $K(P, P)$ and $K(Q, Q)$.

It turns out, using similar tricks as for $L_2$, that $K(\N(\mu, \Sigma), \N(\mu', \Sigma'))$ is $$ (2 \pi \sigma^2)^{d/2} \N(\mu; \mu', \Sigma + \Sigma' + \sigma^2 I) .$$

As $\sigma \to 0$, clearly this converges to a multiple of the $L_2$ distance. You'd normally want to use a different $\sigma$, though, one on the scale of the data variation.

Closed forms are also available for polynomial kernels $k$ in the MMD; see

Muandet, Fukumizu, Dinuzzo, and Schölkopf (2012). Learning from Distributions via Support Measure Machines. In Advances in Neural Information Processing Systems (official version). arXiv:1202.6504.

For a lot of nice properties of this distance, see

Sriperumbudur, Gretton, Fukumizu, Schölkopf, and Lanckriet (2010). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11, 1517–1561. arXiv:0907.5309.
Quadratic Jensen-Rényi divergence. The Rényi-$\alpha$ entropy is defined as $$ H_\alpha(p) = \frac{1}{1-\alpha} \log\left( \int p(x)^\alpha \,\ud x \right) .$$ Its limit as $\alpha \to 1$ is the Shannon entropy. The Jensen-Rényi divergence is $$ \mathrm{JR}_\alpha(p, q) = H_\alpha\left( \frac{p + q}{2} \right) - \frac{H_\alpha(p) + H_\alpha(q)}{2} $$ where $\frac{p + q}{2}$ denotes an equal mixture between $p$ and $q$. It turns out that, when $\alpha = 2$ and when $P$ and $Q$ are Gaussian mixtures (as here), you can compute a closed form for $\mathrm{JR}_2$. This was done by

Wang, Syeda-Mahmood, Vemuri, Beymer, and Rangarajan (2009). Closed-Form Jensen-Renyi Divergence for Mixture of Gaussians and Applications to Group-Wise Shape Registration. Med Image Comput Comput Assist Interv., 12(1), 648–655. (free pubmed version)

score 0 · Answer 2 · answered Jul 18 '18 at 15:02

Here is a generalization of the Mahalanobis D to GMMs using the Fisher Kernel method and other techniques:

Tipping, Michael E. "Deriving cluster analytic distance functions from Gaussian mixture models." (1999): 815-820. https://pdfs.semanticscholar.org/08d2/0f55442aeb79edfaaaafa7ad54c513ee1dcb.pdf

See also: Is there a multi-Gaussian version of the Mahalanobis distance ?

score 0 · Answer 3 · answered Oct 04 '13 at 08:05

0

If your clusters are actually not Gaussian mixtures but arbitrarily shaped, your results may actually be much better when you produce much more clusters, then merge some again afterwards.

In many cases, one just chooses k to be arbitrarily high, e.g. 1000 for a large data set; in particular when you aren't really interested in the models, but just want to reduce the complexity of the data set via vector quantization.

answered Oct 04 '13 at 08:05

Has QUIT--Anony-Mousse

39,639
7
61
96

I simulated the clusters to be drawn from a Gaussian mixture, so I think my assumption is valid. The goal here is not to reduce complexity or come up with a decision criterion for choosing k, but to compare how well k clusters models the data when k is actually incorrect. Some incorrect choices might model the data better than others, and I'm trying to quantify this degree of misfit with some calculation (like KL divergence, but easier to implement for Gaussian mixtures). – dmartin Oct 04 '13 at 13:07

Distance between two Gaussian mixtures to evaluate cluster solutions

3 Answers3

Linked