Simple technique to identify number of clusters in dataset

Question

I have a survey app (programmed using Ruby On Rails), and I am required to cluster the responses.

I am using a Ruby library called AI4R and my code (in the event it is useful...) looks like the below (example code filched from AI4R)

# 5 Questions on a post training survey
questions = [   "The material covered was appropriate for someone with my level of knowledge of the subject.", 
                "The material was presented in a clear and logical fashion", 
                "There was sufficient time in the session to cover the material that was presented", 
                "The instructor was respectful of students", 
                "The instructor provided good examples"]

# Answers to each question go from 1 (bad) to 5 (excellent)
# The answers array has an element per survey complemented. 
# Each survey completed is in turn an array with the answer of each question.
answers = [ 
            [ 1, 2, 3, 2, 2],   # Answers of person 1
            [ 5, 5, 3, 2, 2],   # Answers of person 2
          ]

data_set = DataSet.new(:data_items => answers, :data_labels => questions)

# Let's group answers in 4 groups
clusterer = Diana.new.build(data_set, 4)

This in turn lets me create graphs like this (the survey has questions which are linked to themes/axes).

enter image description here

The problem is that right now you have to pick (read guess) the number of clusters to pass into AI4R.

I saw on Wikipedia that there is a technique called the Elbow Method (illustrative picture from Wikipedia),

enter image description here

which compares the number of clusters with the variance that they explain. This technique would be perfect for my needs, but I don't know how to implement it in Ruby (or with pen and paper).

What stats technique can I use to calculate the number of clusters vs the percentage variance that they explain?

Closely related: [Elbow criteria to determine number of cluster](http://stats.stackexchange.com/q/11175/930), [How to tell if data is “clustered” enough for clustering algorithms to produce meaningful results?](http://stats.stackexchange.com/q/11691/930) — chl, Oct 02 '13 at 13:12
"Variance explained" is the between-cluster variance. It is all about ANOVA. Perform one-way ANOVA by factor = cluster variable, for every feature (question). Sum SSerror across the features (this is **W**). Sum SStotal across the features (this is **T**). Variance explained is **(T-W)/(k-1)** where k is the number of clusters. — ttnphns, Oct 02 '13 at 13:36
Do you have a prior on the distribution of the number of clusters? If so you could just pick the number of clusters with the maximum likelihood. — Mimshot, Oct 02 '13 at 14:16

score 2 · Accepted Answer · edited Oct 02 '13 at 13:13

The elbow criterion is comparable to the scree plot method in factor analysis, but unfortuantely both are qualitative methods in a certain sense (Horn's parallel analysis is an alternative for PCA). The intuition is the following: Any increase in factor or cluster number will allow you to "explain" more variance (i.e., for cluster analysis it refers to decreasing within-cluster variance compared to the sum of between-cluster-variance and intra-cluster variance or the total variance). If you had as many clusters as data points, there would be no variance left. Now, the assumption is, that any meaningful clusters will be different in the sense that they explain more variance than what could be expected from random variations alone, whereas the last clusters will lead to smaller increments in variance. The resulting plot would therefore show a point form which on there is a more or less straight line creating an angle akin to a human elbow. One way to quantify this procedure is to repeat the analysis a large number of times with random data, and plot both the obtained pattern with the average of random runs. Neither technique will guarantee, of course, that you will discover the "true" cluster structure, if there even is such a thing in the problem that you are looking at.

Basically you will have to calculate the relevant statistics for each cluster solution in the sequence.

comment from Iee (transferred from message): I don't know that much about stats (the discipline)...I get that I will have to calculate for each possible cluster number some stats and compare the results. But what are the relevant statistics here, for instance with the elbow method it is the percentage variance explained, but how would I calculate that (a high level algorithm for the easiest technique that will get the job done would be massively appreciated!). — jank, Oct 02 '13 at 13:31
if you use a simple ANOVA with cluster membership as factor you would get the relevant statistics out of that (Eta Squared...) — jank, Oct 02 '13 at 13:32

Simple technique to identify number of clusters in dataset

1 Answers1