Questions tagged [clustering]

Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]

Tag Usage

Clustered-standard-errors and/or cluster-samples should be tagged as such; do not use the "clustering" tag for them. Both these methodologies take clusters as given, rather than discovered.

Overview

Clustering, or cluster analysis, is a statistical technique of uncovering groups of units in multivariate data. It is separate from classification (clustering could be called "classification without a teacher"), as there is no units with known labels, and even the number of clusters is usually unknown, and needs to be estimated. Clustering is a key challenge of data mining, in particular when done in large databases.

Although there are many clustering techniques, they fall into several broad classes: hierarchical clustering (in which a hierarchy is built from each unit representing their own cluster up to the whole sample being one single cluster), centroid-based clustering (in which are units are put into the cluster nearest to a specific centroid), distribution- or model-based clustering (in which clusters are assumed to follow a specific distribution, such as multivariate Gaussian), and density-based clustering (in which clusters are obtained as the areas of the highest estimated density).

References

Consult the following questions for resources on clustering:

3772 questions
419
votes
5 answers

How to understand the drawbacks of K-means

K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just apply this algorithm which minimizes the sum of…
KevinKim
  • 6,347
  • 4
  • 21
  • 35
328
votes
8 answers

Why is Euclidean distance not a good metric in high dimensions?

I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high dimensions'? I have been applying hierarchical…
125
votes
7 answers

Clustering on the output of t-SNE

I've got an application where it'd be handy to cluster a noisy dataset before looking for subgroup effects within the clusters. I first looked at PCA, but it takes ~30 components to get to 90% of the variability, so clustering on just a couple of…
generic_user
  • 11,981
  • 8
  • 40
  • 63
109
votes
7 answers

Detecting a given face in a database of facial images

I'm working on a little project involving the faces of twitter users via their profile pictures. A problem I've encountered is that after I filter out all but the images that are clear portrait photos, a small but significant percentage of twitter…
ʞɔıu
  • 1,107
  • 2
  • 8
  • 5
99
votes
5 answers

What is the relation between k-means clustering and PCA?

It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction). However I am interested in a comparative and…
mic
  • 3,848
  • 3
  • 23
  • 38
95
votes
6 answers

What is the difference between Multiclass and Multilabel Problem

What is the difference between a multiclass problem and a multilabel problem?
Learner
  • 4,007
  • 11
  • 37
  • 39
90
votes
7 answers

Euclidean distance is usually not good for sparse data (and more general case)?

I have seen somewhere that classical distances (like Euclidean distance) become weakly discriminant when we have multidimensional and sparse data. Why? Do you have an example of two sparse data vectors where the Euclidean distance does not perform…
shn
  • 2,479
  • 9
  • 31
  • 38
88
votes
6 answers

How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?

How would you know if your (high dimensional) data exhibits enough clustering so that results from kmeans or other clustering algorithm is actually meaningful? For k-means algorithm in particular, how much of a reduction in within-cluster variance…
xuexue
  • 2,098
  • 2
  • 16
  • 11
84
votes
6 answers

Why does k-means clustering algorithm use only Euclidean distance metric?

Is there a specific purpose in terms of efficiency or functionality why the k-means algorithm does not use for example cosine (dis)similarity as a distance metric, but can only use the Euclidean norm? In general, will K-means method comply and be…
curious
  • 971
  • 1
  • 7
  • 7
80
votes
6 answers

Choosing a clustering method

When using cluster analysis on a data set to group similar cases, one needs to choose among a large number of clustering methods and measures of distance. Sometimes, one choice might influence the other, but there are many possible combinations of…
Brett
  • 5,708
  • 3
  • 29
  • 41
73
votes
7 answers

Where to cut a dendrogram?

Hierarchical clustering can be represented by a dendrogram. Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick where to cut the dendrogram? Is there something…
Eduardas
  • 2,239
  • 4
  • 23
  • 22
71
votes
2 answers

Performance metrics to evaluate unsupervised learning

With respect to the unsupervised learning (like clustering), are there any metrics to evaluate performance?
user3125
  • 2,617
  • 4
  • 25
  • 33
69
votes
2 answers

How can an artificial neural network ANN, be used for unsupervised clustering?

I understand how an artificial neural network (ANN), can be trained in a supervised manner using backpropogation to improve the fitting by decreasing the error in the predictions. I have heard that an ANN can be used for unsupervised learning but…
65
votes
9 answers

Clustering with a distance matrix

I have a (symmetric) matrix M that represents the distance between each pair of nodes. For example, A B C D E F G H I J K L A 0 20 20 20 40 60 60 60 100 120 120 120 B 20 0 20 20 60 80 80 80 120 140 140…
yassin
  • 753
  • 1
  • 6
  • 6
61
votes
3 answers

Clustering with K-Means and EM: how are they related?

I have studied algorithms for clustering data (unsupervised learning): EM, and k-means. I keep reading the following : k-means is a variant of EM, with the assumptions that clusters are spherical. Can somebody explain the above sentence? I do…
1
2 3
99 100