1

For my dataset of ~19K data points to cluster, I want to use a criterion to choose the number of clusters. BIC (Bayesian Information Criterion) gives too few clusters (~180) while AIC (Akaike Information Criterion) gives too many (~1400). Intuitively, I feel that ~500 clusters would be optimal putting ~40 data points in each cluster on average. But apparently, I need to have a statistical explanation for choosing ~500. Is there a way to combine AIC and BIC such that we have neither too few nor too many clusters?

I am not asking about when choosing one of AIC or BIC over the other. I already know that BIC penalizes the number of free parameters much more than AIC, but based on prior information about the data I have, I want to have a penalty which is not as high as BIC's and not as low as AIC's.

I can just select 500 clusters and go ahead, but the reviewers of the submitted papers always need some statistical reason for choosing cluster count, that's actually why I need that.

Here are the formulas that I use for BIC and AIC:

BIC: $-2 \times ln(L) + ln(p) \times k\times n $

AIC: $-2 \times ln(L) + 2\times k\times n$

where

p = the number of data points to cluster
k = the number of clusters
n = the number of dimensions of each data point
L = the likelihood.
user5054
  • 1,259
  • 3
  • 13
  • 31
  • For something canned, you might want to consider using ICL (Integrated Completed Likelihood - a classification-like version of BIC) or NEC (Normalised Entropy Criterion). – usεr11852 Jun 30 '16 at 01:23
  • 1
    Why not just choose 500 clusters? – Sycorax Jun 30 '16 at 03:48
  • 1
    Possible duplicate of [Is there any reason to prefer the AIC or BIC over the other?](http://stats.stackexchange.com/questions/577/is-there-any-reason-to-prefer-the-aic-or-bic-over-the-other) – Xi'an Jun 30 '16 at 06:04
  • You are presumably speaking of AIC/BIC _clustering criterions_? Please give their formulas or direct link to them and how they are used! So far it is unclear what you were doing. – ttnphns Jun 30 '16 at 06:43
  • @General Abrial, Xi'an, ttnphns, thanks for the comments. I updated the question accordingly. – user5054 Jun 30 '16 at 06:58
  • But AIC and BIC (original) themselves are not clustering criterions, they can't help choosing the number of clusters. There exist clustering criterions based on AIC or BIC. And I'm asking: bring in their formulas. Show what you are using, please! Display how you compute the number of clusters. – ttnphns Jun 30 '16 at 07:04
  • Updated the question, now have the formulas. – user5054 Jun 30 '16 at 08:19
  • 1
    Have you tried AICc? This is AIC with an extra term to penalise overfitting. https://en.wikipedia.org/wiki/Akaike_information_criterion#AICc – arboviral Jun 30 '16 at 08:26
  • A bit strange formulas. Where did you get them from? `log-likelihood which is the negative of the total intra-cluster sum of squares` Log-likelihood should itself imply a logarithm inside; but you then take logarithm one more time of it. [Here](http://stats.stackexchange.com/q/55147/3277) I gave computation of AIC and BIC clustering criterions as they are computed in TwoStep cluster analysis of SPSS. – ttnphns Jun 30 '16 at 08:35
  • @ttnphns That's right, the inside log is not needed. Corrected.. Thanks for the link! – user5054 Jun 30 '16 at 08:49
  • @GeneralAbrial: That would be a strong prior! :D – usεr11852 Jun 30 '16 at 09:14
  • As explained on stackoverflow.com/questions/15839774/…, for k-means, $-ln(L)$ in BIC and AIC formulas are the k-means objective to minimize, which is the total intra-cluster sum-of-squares. I think this is coming from the Gaussian distribution formula. – user5054 Jun 30 '16 at 10:36

0 Answers0