1

I have some microarray data (~15 samples) which I've clustered via pam, with a range of cluster sizes and I want to find out the optimal k with BIC.

I basically want to re-implement the BIC score from the x-means paper and this stat.stackexchange post answered some basic questions. But it seems that their definition of sigma is for the unidimensional case. How would I calculate the Covariance matrix for my multidimensional dataset to plug into the multivariate Gaussian log-likelihood function?

I could be missing something obvious, but I can't seem to find a reference to explain the multivariate case for cluster models. I can add a reproducible example if needed.

update: Here's the formula for variance: $$ \sigma^2 = \frac{1}{R-K}\sum_{i}(x_i - \mu_{(i)})^2 $$ Here, $x_i$ is the sample point and $\mu_{(i)}$ is the cluster center for the cluster which the sample belongs to. In the multivariate case, a point is defined by a vector of size $n$ (for example, a row $i$ in the data matrix) so the mean $\mu_{(i)}$ should also be 1 by n vector. How then do they get a single number for the variance?


1 X-means: extending K-means with efficient estimation of the number of clusters, Pelleg & Moore

zzk
  • 697
  • 6
  • 14
  • I haven't read that paper but I know how SPSS computes BIC clustering criterion in their TwoStep cluster. Logarithms of sigmas (variances) of all variables are summed. So it isn't "univariate"; however, covariances are not computed and used, which means that there is an assumption of uncorrelatedness of variables in each cluster. This is not an uncommon assumption, famous old Calinski-Harabasz clustering criterion makes it, too. – ttnphns Mar 11 '13 at 06:20
  • ...or, I would rephrase it: not "uncorrelatedness assumption is made" but rather "covariance structure is not taken into account" when computing multivariate variability. Really, it is reasonable: whatever the covariances, the overall variability is the same, - if you replace a covariance matrix of variables by the covariance matrix of their principal components the trace of these matrices is the same. – ttnphns Mar 11 '13 at 06:49
  • thanks for the reply. Upon reading, I realized my problem is pretty basic... I also flipped my thinking, the question is about multi-dimensional variance, not multivariate. Oops! I edited the question accordingly. – zzk Mar 11 '13 at 06:56
  • update: I'm assuming the (x_i - u_i) is the manhattan distance until someone tells me otherwise. – zzk Mar 11 '13 at 23:13

0 Answers0