12

I would like to code a kmeans clustering in python using pandas and scikit learn. In order to select the good k, I would like to code the Gap Statistic from Tibshirani and al 2001 (pdf).

I would like to know if I could use inertia_ result from scikit and adapt the gap statistic formula without having to recode all the distances calculation.

Does anyone know the inertia formula used in scikit / know an easy way to recode the gap statistic using high level distance functions?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Scratch
  • 754
  • 2
  • 6
  • 17
  • 1
    I think this question has sufficient statistical content to be on-topic for CV, but note that it requires fairly sophisticated programming & Python knowledge as well. It may be difficult to get a good answer. You may want to ask for / be willing to settle for *pseudocode* as well, &/or you may need to split this question up into 2 parts, 1 here about the statistical aspects & 1 part on [Stack Overflow](http://stackoverflow.com/) about the Python programming aspects. (Or maybe not, I don't know for sure, but I just want to give you fair warning; we'll see how it goes.) – gung - Reinstate Monica Dec 02 '13 at 16:49
  • 1
    This question needs the term "inertia" be defined. It looks like its coined within `python`. – ttnphns Dec 02 '13 at 17:21

1 Answers1

9

I guess I found my answer for kmeans clustering:

By looking at the git source code, I found that for scikit learn, inertia is calculated as the sum of squared distance for each point to it's closest centroid, i.e., its assigned cluster. So $I = \sum_{i}(d(i,cr))$ where $cr$ is the centroid of the assigned cluster and $d$ is the squared distance.

Now the formula of gap statistic involves $$ W_k = \sum_{r=1}^{k}\frac 1 {(2*n_r) }D_r $$ where $D_r$ is the sum of the squared distances between all points in cluster $r$.

By introducing $+c$, $-c$ in the squared distance formula ($c$ being the centroid of cluster $r$ coordinates), I have a term that corresponds to Inertia (as in scikit) + a term that disappears if each $c$ is the barycentre of each cluster (which it is supposed to be in kmeans). So I guess $W_k$ is in fact scikit Inertia.

I have still two questions:

  1. Do you think my calculus is correct? (For example, I don't know if it holds for hierarchical clustering.)
  2. If I am correct above, I have coded the gap statistic (as difference of log inertias between estimation and clustering) and it performs badly especially on the iris dataset, has anyone tried it?
Scratch
  • 754
  • 2
  • 6
  • 17
  • 2
    It is best not to pose questions in your answers. If this isn't really the answer to your question, but just a partial solution to clarify the real question, it would be better to edit your question & paste this information in. – gung - Reinstate Monica Feb 04 '14 at 02:16
  • 1
    @Scratch did you ever get a python implementation of the gap statistic to work on the Iris data set? I am struggling with the same issue. – Zelazny7 Feb 25 '14 at 00:23
  • Yes I coded one a few month ago. How can I send you that ? – Scratch Feb 25 '14 at 08:40
  • 1
    Should not be formula be this $$W_k = \sum_{r=1}^{k}\frac {D_r} {(2*n_r) }$$ ? – Biswanath Jun 03 '14 at 15:06