0

during a cluster analysis procedure, how would I approach finding an appropriate number of clusters within my data? I've been experimenting with kmeans a little doing the following:

  1. run kmeans (with m clusters) on my feature set n times, n times because I wanted to try to overcome the limitations of random outcomes given the nature of the algorithm
  2. pick the "majority vote" out of the n "cluster votes" in order to choose the appropriate cluster membership
  3. iterate, i.e. repeat the procedure over a range of assumed amount of clusters within the data

What are alternatives to the approach sketched above?

Another issue is the fact, that I have "ordered, categorical" (ordinal) data in my dataset. I know that this might be a problem with kmeans. What are my alternatives algorithm-wise?

Thanks in advance, Andi

ttnphns
  • 51,648
  • 40
  • 253
  • 462
A. Neumann
  • 141
  • 3
  • 1
    Search and read this site and internet on `clustering criterions`, `cluster analysis validation`, `choose number of clusters`. K-means requires interval-level variables. – ttnphns Jul 12 '16 at 17:30

1 Answers1

0

during a cluster analysis procedure, how would I approach finding an appropriate number of clusters within my data?

What are alternatives to the approach sketched above?

Different clustering techniques can follow different rules. k-means procedures often seek to minimize the within-sums of squares. Here is an example in R:

http://www.statmethods.net/advstats/cluster.html

These techniques do not follow a probability model, and are often based on a "best guess" approach. Model-based clustering approaches exist, one of the most known approaches uses a Gaussian-mixed model approach. The Mclust library in R uses this approach; here is a reference:

https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html

In comparison to k-means, mclust allows for model comparisons via a Bayesian Information Criterion (BIC). For a summary on how mclust uses BIC for model selection, see the thread Mclust model selection

Another issue is the fact, that I have "ordered, categorical" (ordinal) data in my dataset. I know that this might be a problem with kmeans. What are my alternatives algorithm-wise?

Cluster Analysis, 5th edition, Everitt et al. on Table 9.1 discusses various clustering approaches for various data types. For mixed data types, model-based approaches are a suggested option. So mclust would be a good tool to use in your situation.

Best of luck!

Jon
  • 2,180
  • 1
  • 11
  • 28