Using hierarchical cluster analysis in data

Question

I am trying to build a basic content-based recommender engine based on movie genres, the data set is from MovieLens. Some of the movies have more than 2 genres.

part of my code:

 head(genre_matrix2,15)

       Action Adventure Animation Children Comedy
1         0         1         1        1      1
2         0         1         0        1      0
3         0         0         0        0      1
4         0         0         0        0      1
5         0         0         0        0      1
6         1         0         0        0      0
7         0         0         0        0      1
8         0         1         0        1      0
9         1         0         0        0      0
10        1         1         0        0      0
11        0         0         0        0      1
12        0         0         0        0      1
13        0         1         1        1      0
14        0         0         0        0      0

For example, the first movie belongs to Adventure, Animation, Children and Comedy genres.

Because there are total 18 genres, and maybe some movies have same genres matched. I want to do the cluster analysis, make it in a more precise and small categories. For example, if 3 movies all have Comedy, Action and Romance genres, they are more likely to be in the same categories.

I used hierarchical cluster analysis:

e.dist <- dist(genre_matrix2, method="euclidean")
h.E.cluster <- hclust(e.dist) 
plot(h.E.cluster)
h.cluster <- hclust(e.dist, method="ward.D2")
plot(h.cluster)

cut.h.cluster <- cutree(h.cluster, k=5)
table(cut.h.cluster)

   1    2    3    4    5 
1675 1159 1844 2001 2446

Then I use the Elbow Method to see if this is Optimal number of clusters:

fviz_nbclust(genre_matrix2, 
             FUNcluster = hcut,  
             method = "wss",     
             k.max = 18          
             ) + 
labs(title="Elbow Method for HC") +
geom_vline(xintercept = 5,      
           linetype = 2)

There's no elbow in this, does it mean that I souldn't do cluster analysis of this? Any suggestion is welcomed!

This sort of question is _frequent_ by not experienced cluster analysis users (I saw it many times even on this site alone). First, it very much questionnable to use euclidean distance with binary (or any categorical) data. There are a heap of better distance measures specially designed for binary attributes. At least you could use Hamming distance which, for binary data is equal to Manhattan and to squared euclidean d). — ttnphns, Sep 13 '17 at 21:51
Second, don't use Ward's linkage method with binary data - it is likewise questionable. [Use](https://stats.stackexchange.com/q/195446/3277) average or complete or single linkage. Third, Ward's dendrogram is deceptive ([read here](https://stats.stackexchange.com/a/63549/3277)). I can't say specifically for the R function you used (I'm not R user), but if the dendrogram is based on squared increment it is visually falsely "optimistic". For example, your dendrogram does not seem to me telling about a cluster structure present. — ttnphns, Sep 13 '17 at 21:58
@ttnphns what advantages do the methods you outline produce over using Euclidean distance measures for binary data, and is there a consensus on what to use if you have both catagorical and numerical data? — Dale C, Sep 13 '17 at 23:40

score 1 · Answer 1 · answered Sep 13 '17 at 11:03

1

In those situations I just use another method to choose the number of clusters. I would divide the total within sum of square by the total sum of square and pick enough clusters to keep XX% of variance.

answered Sep 13 '17 at 11:03

I used silhouette to choose the number of clusters and it said the optimal number of clusters is 3 :) – Sep 13 '17 at 14:21
But how do you know it is really optimal? I doubt it is a good result. – Has QUIT--Anony-Mousse Sep 17 '17 at 08:52
@Anony-Mousse I don't. I am not sure which method can interpret. – Makiyo Sep 18 '17 at 01:12

score 1 · Answer 2 · answered Sep 17 '17 at 09:02

Clustering such data in a meaningful way is difficult. It is much easier to get bad results.

Challenges include:

Extremely low data resolution. Many movies will only have 1-3 genres, I.e. you have only these three bits of difference to zero.
Similarity measure: what is an adequate similarity measure for this data? Intersection? Hamming? Jaccard? Dice? And because of the previous issue, you will get many of identical distances, so the result with hierarchical clustering will be very much order dependent.
Noise. A lot of objects may not belong to a meaningful cluster at all.
Overlapping clusters. E.g. if we have a cluster of "teenie comedy films" it will overlap with the cluster "bad comedy films". If you use a clustering algorithm that does not allow overlap, you won't be able to find such clusters.
Parameterization: how many clusters? Choosing based on knee, VRC, silhouette, Dunn, etc. will likely produce meaningless results because of all the issues above (and even if you'd solve all of them, parameterization will remain a problem. Don't blindly rely on such a heuristic!)

You may need to define your own method. Don't use k-means, Ward etc. here. — Has QUIT--Anony-Mousse, Sep 18 '17 at 06:11

Using hierarchical cluster analysis in data

2 Answers2