Questions tagged [hierarchical-clustering]

Hierarchical cluster analysis is a method of cluster analysis which builds, by steps, a hierarchy of clusters, a dendrogram. Most popular is agglomerative hierarchical clustering (HAC) which starts from individual objects and collects them into bigger and bigger clusters.

435 questions
55
votes
3 answers

How to select a clustering method? How to validate a cluster solution (to warrant the method choice)?

One of the biggest issue with cluster analysis is that we may happen to have to derive different conclusion when base on different clustering methods used (including different linkage methods in hierarchical clustering). I would like to know your…
51
votes
2 answers

Choosing the right linkage method for hierarchical clustering

I am performing hierarchical clustering on data I've gathered and processed from the reddit data dump on Google BigQuery. My process is the following: Get the latest 1000 posts in /r/politics Gather all the comments Process the data and compute an…
33
votes
3 answers

How to interpret the dendrogram of a hierarchical cluster analysis

Consider the R example below: plot( hclust(dist(USArrests), "ave") ) What exactly does the y-axis "Height" mean? Looking at North Carolina and California (rather on the left). Is California "closer" to North Carolina than Arizona? Can I make this…
Richi W
  • 3,216
  • 3
  • 30
  • 53
27
votes
1 answer

Using correlation as distance metric (for hierarchical clustering)

I would like to hierarchically cluster my data, but rather than using Euclidean distance, I'd like to use correlation. Also, since the correlation coefficient ranges from -1 to 1, with both -1 and 1 denoting "co-regulation" in my study, I am…
Megatron
  • 373
  • 1
  • 3
  • 7
21
votes
4 answers

How to understand the drawbacks of Hierarchical Clustering?

Can someone explain the pros and cons of Hierarchical Clustering? Does Hierarchical Clustering have the same drawbacks as K means? What are the advantages of Hierarchical Clustering over K means? When should we use K means over Hierarchical…
21
votes
2 answers

Clustering -- Intuition behind Kleinberg's Impossibility Theorem

I've been thinking about writing a blog post on this interesting analysis by Kleinberg (2002) that explores the difficulty of clustering. Kleinberg outlines three seemingly intuitive desiderata for a clustering function and then proves that no such…
12
votes
1 answer

Hierarchical clustering with categorical variables

Can categorical variables be used in hierarchical clustering? I have heard only continuous variables are used, but have seen people discussing categorical variables may / may not be used as well. Can anyone provide insight?
Windstorm1981
  • 314
  • 2
  • 14
11
votes
1 answer

How to interpret dendrogram height for clustering by correlation

Given the following data frame: df <- data.frame(x1 = c(26, 28, 19, 27, 23, 31, 22, 1, 2, 1, 1, 1), x2 = c(5, 5, 7, 5, 7, 4, 2, 0, 0, 0, 0, 1), x3 = c(8, 6, 5, 7, 5, 9, 5, 1, 0, 1, 0, 1), x4 = c(8,…
Waldir Leoncio
  • 2,137
  • 6
  • 28
  • 42
11
votes
4 answers

Choosing the number of clusters in hierarchical agglomerative clustering

I have a set of points that I want to cluster into groups according to a number of features computed. I have distance matrix containing the distances between all different pairs of points. I have tried K-Means, and DBSCAN first but since I have no…
Moustafa Alzantot
  • 281
  • 1
  • 2
  • 7
9
votes
2 answers

Does k-means have any advantages over HDBSCAN expect for runtime?

I have recently learned about HDBSCAN (a fairly new method for clustering, not yet available in scikit-learn) and am really surprised at how good it is. The following picture illustrates that the predecessor of HDBSCAN - DBSCAN - is already the only…
Thomas
  • 213
  • 3
  • 7
9
votes
2 answers

What is the interpretation of eps parameter in DBSCAN clustering?

I want to cluster lat-long data such that all clusters formed will have radius<=1000 meters Questions What is the actual meaning of eps parameter? Please given an example. Will setting eps=1000 serve my purpose if distance measure is haversine in…
9
votes
2 answers

Does a distance have to be a "metric" for an hierarchical clustering to be valid on it?

Let us say that we define a distance, which is not a metric, between N items. Based on this distance we then use an Agglomerative hierarchical clustering. Can we use each of the known algorithm (single/maximum/avaerage linkage etc), to get…
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
8
votes
1 answer

Hierarchical clustering of correlation matrix

I have a correlation matrix of 8,854 * 8,854 size. These are Pearson correlation coefficient values in the matrix. I want to perform Hierarchical clustering and create good resolution images like I have attached. A step by step explanation would be…
bsoni
  • 81
  • 1
  • 1
  • 3
8
votes
2 answers

Can sub-optimality of various hierarchical clustering methods be assessed or ranked?

Classic agglomerative hierarchical clustering methods are based on a greedy algorithm. This means that they (many of them) are prone to give sub-optimal solutions instead of the global optimum result, especially on later steps of agglomeration. To…
ttnphns
  • 51,648
  • 40
  • 253
  • 462
8
votes
2 answers

Choosing the number of clusters - clustering validation criterions vs domain theoretical considerations

I often face the issue of having to choose a k number of clusters. The partition I end up choosing is more often based on visual and theoretical concerns rather than quality criteria. I have two main questions. The first concerns the general…
1
2 3
28 29