3

I am trying to cluster the data set 'How Americans spend their time' using kmeans clustering.

The data set contains education, gender and age-range (55-60, 60-65 etc) as categorical variables and rest of the variables such as no-of-hours in socializing & relaxing, no-of-hours shopping, no-of-hours watching TV etc are all integers.

I have converted categorical variables to dummy variables. Next step is scaling (scaling and centering). Should I center and scale dummy variables also along with numeric variables.

I get very different clusters when I center and scale dummy variables (along with numeric variables) than when I center and scale numeric variables only. Which approach should I rely on? My feeling is I should also center and scale dummy variables along with numeric variables.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user3282777
  • 467
  • 1
  • 4
  • 10
  • 2
    K-means is not recommended with binary data. And for dummy variables - simply inappropriate. http://stats.stackexchange.com/q/174556/3277; http://stats.stackexchange.com/a/81549/3277; http://stats.stackexchange.com/q/148417/3277. – ttnphns Nov 10 '16 at 10:10
  • 1
    See https://stats.stackexchange.com/questions/140711 for how standardization changes clusters. – whuber Dec 30 '19 at 22:10

0 Answers0