Clustering with gaussian mixtures: choice of hyperparameters

Question

Question: I am interest in general in understanding how to choose the hyperparameters if we are interested in clustering bivariate vectors assuming a mixture of Gaussian mixture with conjugate Normal-inverse-Wishart prior for the random parameters. How should I fixed the hyperparameters?

For simplicity lets focus on the following specific case.

Data: $\mathbf{y}=(y_1,\ldots,y_n)$, where $y_i \in \mathbb{R}^2$ and $n=1000$.

Model: \begin{align} Y_i \mid (\mu_i, \Sigma_i) &\overset{ind}{\sim} \mbox{N}(\mu_i, \Sigma_i)\\ (\mu_i, \Sigma_i) &\overset{iid}{\sim} P \\ P &\sim \mbox{DP}(\alpha,\mbox{NIW}(\mu_0, \lambda_0, \mathbf{\Psi}, \nu_0)). \end{align}

Goal of the analysis: clustering. Formally, we want make inference for the random partition of $\{1,\ldots,n\}$ induced by the ties between $\{(\mu_i, \Sigma_i): i=1,\ldots,n\}$. That is observations $i$ and $i^\prime$ are clustered together if and only if $(\mu_i, \Sigma_i)=(\mu_{i^\prime}, \Sigma_{i^\prime})$. We can rewrite the model directly in terms of the clustering membership indicators $(Z_1,\ldots,Z_n)$, where $Z_i$ is the clustering membership indicator for the $i$th observation. Note that observations $i$ and $i^\prime$ are clustered together if and only if $Z_i=Z_i^\prime$. We call $K_n$ the random number of clusters, that is the number of unique values between $(Z_1,\ldots,Z_n)$ and $((\mu_k^\ast, \Sigma_k^\ast): \,k=1,\ldots, K_n)$ the associated unique parameters.

Model reparametrization:

\begin{align} Y_i \mid (\mu_k^\ast, \Sigma_k^\ast), \{Z_i=k\} &\overset{ind}{\sim} \mbox{N}(\mu_k^\ast, \Sigma_k^\ast)\\ (\mu_k^\ast, \Sigma_k^\ast) \mid K_n &\overset{iid}{\sim} \mbox{NIW}(\mu_0, \lambda_0, \mathbf{\Psi}, \nu_0)\\ (Z_1,\ldots,Z_n) &\sim \mbox{CRP}(\alpha) \end{align}

My attempt and thoughts:

Prior belief if I have strong prior opinion I should translate the prior belief in probability assumptions.
Cluster meaning moreover, also the idea of useful cluster change according to the application. So it is not possible to have a single answer for any clustering problem. In my application I consider clusters as compact clouds of points in the bivariate space (e.g. points in high density regions of different Gaussian components).
Hyperprior putting a prior on the hyperparameters and learn them is certainly a meaningful strategy even if it shift the problem in another level, but introducing more uncertainty and potentially setting less strong prior opinion. However, I do not want to do it, but I want to understand different reasonable choice of the hyperparameters (it is ok also data dependent such as empirical Bayes) and what they imply in terms of the inference in the clustering analysis.
Practical observations setting a small value of $\lambda_0$ (e.g. $\lambda_0=0.01$) allows to learn the location of the clustering parameter $\mu_k^\ast$ setting it close to the empirical mean (within cluster) even in small clusters. On the other hand, setting $\lambda_0$ small increases the mean of the variance parameter $\Sigma_k^\ast$ so a $\lambda_0$ small should be partially counterbalanced by a choice of $\Psi$ not to vague.

I was wandering if some practical considerations like the previous are available (in order to avoid rediscovering known facts and also don't make mistakes) and if there are also some default reasonable and well justified methods to set the hyperparameters (given all the limit of a default single answer).

Thank you, any advise is appreciated!

Notation and references:

$Y_i$ is the observable random variable whose realization is $y_i$;
$\overset{iid}{\sim}$ refers to independent and identically distributed and $\overset{ind}{\sim}$ refers to independently distributed;
$\mbox{N}(\mu, \Sigma)$ represents the bivariate Gaussian density (and also Gaussian probability law according to the context) with mean $\mu \in \mathbb{R}^2$ and variance matrix $\Sigma \in \mathbb{R}^{2\times 2}$. Moreover $\mbox{N}(y \mid \mu, \Sigma)$ refers to the Gaussian density evaluated in the point $y \in \mathbb{R}^2$;
$\mbox{NIW}(\mu_0, \lambda_0, \mathbf{\Psi}, \nu_0)$ represents the the Normal-Inverse-Wishart distribution with parameters $\mu_0\in \mathbb{R}^2$, $\lambda_0\in \mathbb{R}$, $\mathbf{\Psi}\in \mathbb{R}^{2\times 2}$ and $\nu_0\in \mathbb{R}$.
$\mbox{DP}(\alpha,P_0)$ refers to the Dirichlet process distribution with non atomic base measure $P_0$ and concentration parameter $\alpha >0$;
$\mbox{CRP}(\alpha)$ refers to the Chinese restaurant process with parameters $\alpha$.

Clustering with gaussian mixtures: choice of hyperparameters

0 Answers0