1

I have a very large data set with 9000 observations and 25 categorical variables, which I've transformed into binary data and preformed hierarchical clustering and K-modes clustering in R.

library(klaR)
cluster <- list()
for(k in 1:8) 
{
cluster[[paste0("k.", k)]] <- kmodes(data, k,iter.max=100)
}

I would like to know

1) if it's better to specify the number of modes k (where the algorithm chooses a random set of distinct rows from the data as the initial modes) or to specify the initial starting values/modes myself (give it a set of initial distinct cluster modes in replace of k). If the later, how do you decide on meaningful initial modes? For example for k=4, can I specify the initial modes to be 4 rows from the hierarchical binary clustering output where I cut the tree at k=4?

2) how many times I should run the algorithm and

3) if 100 iterations is adequate.

PennyR
  • 11
  • 2

2 Answers2

2

Do not transform the data into binary for k-modes.

Consider a categorial attribute which has 40% A, 30% B, 30% C.

The mode of that attribute is A.

If you transform to binary, you have a mode that is neither A, B, nor C: 60% are 0 in A, 70% are 0 in B and C each.

If you don't have a good reason to start with particular centers, use the default.

Choosing k is hard. Good luck! Try, and study the result. Then repeat.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
0

Dealing with categorical data, and additionally how you scale any numeric data matter. k-means assumes approximately equally sized clusters, and they should be roughly spherical without holes for ideal results (thus concentric rings will not work well with k-means).

The value of $k$ will depend on the data (i.e., look at your results!). There are techniques which automatically search for the best $k$, e.g., x-means clustering.

In terms of initialization, consider k-means++, as this actually does have a theoretical guarantee in terms of the result quality from the very first iteration.

Generally, one would set the number of iterations to be sufficient for convergence - i.e., the centers move very little from iteration to iteration; this can be determined dynamically.

MotiNK
  • 1,224
  • 6
  • 14