Clustering inertia formula in scikit learn

Question

I would like to code a kmeans clustering in python using pandas and scikit learn. In order to select the good k, I would like to code the Gap Statistic from Tibshirani and al 2001 (pdf).

I would like to know if I could use inertia_ result from scikit and adapt the gap statistic formula without having to recode all the distances calculation.

Does anyone know the inertia formula used in scikit / know an easy way to recode the gap statistic using high level distance functions?

I think this question has sufficient statistical content to be on-topic for CV, but note that it requires fairly sophisticated programming & Python knowledge as well. It may be difficult to get a good answer. You may want to ask for / be willing to settle for *pseudocode* as well, &/or you may need to split this question up into 2 parts, 1 here about the statistical aspects & 1 part on [Stack Overflow](http://stackoverflow.com/) about the Python programming aspects. (Or maybe not, I don't know for sure, but I just want to give you fair warning; we'll see how it goes.) — gung - Reinstate Monica, Dec 02 '13 at 16:49
This question needs the term "inertia" be defined. It looks like its coined within `python`. — ttnphns, Dec 02 '13 at 17:21

Scratch · Accepted Answer · 2014-06-03T16:05:57.403

9

I guess I found my answer for kmeans clustering:

By looking at the git source code, I found that for scikit learn, inertia is calculated as the sum of squared distance for each point to it's closest centroid, i.e., its assigned cluster. So $I = \sum_{i}(d(i,cr))$ where $cr$ is the centroid of the assigned cluster and $d$ is the squared distance.

Now the formula of gap statistic involves $$ W_k = \sum_{r=1}^{k}\frac 1 {(2*n_r) }D_r $$ where $D_r$ is the sum of the squared distances between all points in cluster $r$.

By introducing $+c$, $-c$ in the squared distance formula ($c$ being the centroid of cluster $r$ coordinates), I have a term that corresponds to Inertia (as in scikit) + a term that disappears if each $c$ is the barycentre of each cluster (which it is supposed to be in kmeans). So I guess $W_k$ is in fact scikit Inertia.

I have still two questions:

Do you think my calculus is correct? (For example, I don't know if it holds for hierarchical clustering.)
If I am correct above, I have coded the gap statistic (as difference of log inertias between estimation and clustering) and it performs badly especially on the iris dataset, has anyone tried it?

edited Jun 03 '14 at 16:05

answered Dec 03 '13 at 10:36

Scratch

754
2
6
17

2

It is best not to pose questions in your answers. If this isn't really the answer to your question, but just a partial solution to clarify the real question, it would be better to edit your question & paste this information in. – gung - Reinstate Monica Feb 04 '14 at 02:16
1

@Scratch did you ever get a python implementation of the gap statistic to work on the Iris data set? I am struggling with the same issue. – Zelazny7 Feb 25 '14 at 00:23
Yes I coded one a few month ago. How can I send you that ? – Scratch Feb 25 '14 at 08:40
1

Should not be formula be this $$W_k = \sum_{r=1}^{k}\frac {D_r} {(2*n_r) }$$ ? – Biswanath Jun 03 '14 at 15:06

Clustering inertia formula in scikit learn

1 Answers1