2

When computing hierarchical clustering over a data matrix, a dissimilarity matrix is first computed in order to build the tree (dendrogram). For example:

library(pheatmap)
data(iris)

# Make a heatmap
rownames(iris) <- paste0("r", 1:nrow(iris))
p1 <- pheatmap(iris[, 1:4], annotation_row = iris[,5, drop=FALSE])

iris data heatmap

# Explicitly compute the tree and verify it's equal to the one on the plot
dmat <- dist(iris[,1:4])
tree <- hclust(dmat)
identical(tree$height, p1$tree_row$height)

Let's say I am interested in visualizing the (dis)similarities of the observations with respect to other observations, and therefore I plot the heatmap of the distance matrix directly.

iris data distance matrix

Question: is it a valid thing to do to overlay the tree of the first plot over the distance matrix plot directly? Or is it misleading? For example:

# Plot the distance matrix with the previous tree
pheatmap(as.matrix(dmat), 
         clustering_distance_rows = dmat,
         clustering_distance_cols = dmat)

distance matrix with clustering of input data

The confusion arises from the fact that we could actually run hierarchical clustering over the distance matrix as the input data (i.e. internally, this would mean to compute a distance matrix on the distance matrix), and the obtained tree would be different. I believe that in that case, this tree would answer the question of "how similar the observations are with respect to their distance to other observations?", whereas the tree on the input data answers "how similar are the observations with respect to their features?" Is this understanding correct?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
drgxfs
  • 804
  • 6
  • 17
  • Doing hierarchical cluster analysis of cases of a cases x features dataset means first computing the cases x cases distance matrix (as you noticed it), and the algorithm of the clustering runs on that matrix. So, it is correct to plot the distance matrix + the denrogram result together. Inputting the distance matrix as cases x features dataset to cluster is completely different instance of analysis and, most of the time, is unusual to do at all. – ttnphns Mar 12 '19 at 13:32
  • On your first picture, your cluster analysis was run twice. First was clustering of cases (i.e. clustering of cases x cases distance matrix). Second was clustering of features (i.e. of features x features distance matrix). There exist also so called two-way or biclustering which runs both ways and gives a kind of "averaged" or "adjusted" results. – ttnphns Mar 12 '19 at 13:37
  • If we input the distance matrix of the original data as cases x features, it would be like doing hclust(dist(dmat)), in which case, the output tree would correspond to the distance matrix, i.e. the tree describes the similarity of the objects using that distance matrix as input data. Therefore, if the distance matrix has its own tree, which is different from the tree of the original input data, why is it ok (or not) to overlay the original tree that was computed by running hclust over that distance matrix? To me, it looks like these trees describe different things and aren't exchangeable – drgxfs Mar 12 '19 at 14:59
  • 1
    There exist no "tree of the original input data". Hierarchical agglomerative clustering works on square matrices of distances, so the dendrogram pertains to the distance matrix and _through_ it it pertains to the data - either to its rows (cases) or to it columns (features). – ttnphns Mar 12 '19 at 15:53

0 Answers0