0

I have a large dataset with mixed type of data (example):

Age Price Town Size Interests
Small Middle Big Traveling Cooking TV
21 0 1 0 0 1 1 1
34 100 0 1 0 0 1 0
81 200 0 0 1 1 1 0
54 0 0 0 1 1 0 1

and I want to perform a cluster analysis (hierarchical) and I am not sure about the metric I should use. I have searched that the Gower metric can be the way (Hierarchical clustering with mixed type data - what distance/similarity to use?), but I really want to use the weighted variables as I want to have only the age, price, town and interests having the same contribution in the final results and not performig analysis with the age, small town and big town on the same level. Is the Gower distance metric the right one? Is there a function in Python performing the Gower distance with variables weight adjustment? Is there anything else I can do (like dataset modification)?

  • You have 6 variables: age (scale), price (scale), townsize (nominal or ordinal, as you choose), travelling, cooking, tv (these three are binary). This is ok for using Gower. If your function can weight variables - weight them as you like. For example: 1 1 1 .333 .333 .333. – ttnphns Jul 20 '21 at 11:06
  • convert your three dummies (small, medium, big) into single categorical column. – ttnphns Jul 20 '21 at 11:08
  • Ok, thank you. I am actually looking for the function in Python, where weight can be applied. Yes, I can shrink the Town Size into one, but what about the Interests? I want the appearance of this variable on the same level as the others. – user327865 Jul 23 '21 at 14:15
  • Townsize are dummy variables, they represent single categoeical varable. Interests is a multiple response set, binary variables which you cannot and should not "shrink" into one. – ttnphns Jul 23 '21 at 14:54

0 Answers0