Weighting related attributes in hierarchical clustering

Question

I have two questions about output of hierarchical clustering and improving the output.

I'm trying to learn more about performing hierarchical clustering in R so I started looking at a simple dataset I created of sushi rolls at a local restaurant. I went though every roll on the menu and created a distinct list of the union of all ingredients.

Then for each roll I put a 1 if it had that particular ingredient or a 0 for not having it. I then calculated distance based on Jaccard distance and created a dendogram using four different methods.

So my first question is how to interpret these correctly. Since many rolls overlap on quite a bit of ingredients, single-linkage is producing clusters that aren't significantly different from each other? Complete-linkage and the two methods are better reflecting the diversity in the rolls where some are significantly different? I'm not sure I really understand the subtleties between these four methods.

enter image description here

My second question is how would one handle related attributes? Say for example there exists {Tuna, White Tuna, Spicy Tuna} as three different ingredients. I think we would agree that the distance between {Tuna,Avocado} & {Salmon, Avocado} should be larger than {Tuna, Avocado} & {Spicy Tuna, Avocado}. Are the attributes typically collapsed into {Tuna} or is there another way to reflect the relationship?

Regarding "metaphors" different for different hierarchical methods, plus look [here](http://stats.stackexchange.com/a/63549/3277). — ttnphns, Aug 01 '14 at 07:01
I didn't understand what were your objects, i.e. did you compute the distance matrix between the rolls or between the ingredients? — ttnphns, Aug 01 '14 at 07:09
Roll 1 has {Tuna,Avocado}, Roll 2 has {Salmon, Avocado, Cucumber}. Jaccard distance is (1- 1/4) or 0.75. The distance matrix computed similarly for each roll. — ElPresidente, Aug 01 '14 at 13:16

score 1 · Answer 1 · answered Aug 03 '14 at 11:59

The different linkages are explained everywhere, including Wikipedia. I'm not going to copy and paste this here...

As for related attributes: often it makes a lot of sense to merge highly related items into one. If you look at text data, stemming does exactly this, merging "go" and "going" and "gone" into one word. It's a best practise, but of course sometimes you may want to not to this; just as sometimes you don't want to use the stemmed words.

If you keep multiple terms where you know they are highly correlated, you may also want to assign them less weight. So when your attributes are "pink", "rose", "salmon", "green", "blue"; then you may want to assign weights 1/3, 1/3, 1/3, 1, 1 to have less emphasis on the 50 shades of magenta.

Weighting related attributes in hierarchical clustering

1 Answers1