Questions tagged [k-means]

k-means is a method to partition data into clusters by finding a specified number of means, k, s.t. when data are assigned to clusters w/ the nearest mean, the w/i cluster sum of squares is minimized

K-means clustering attempts to achieve the following objective:

Given an integer $k$ and a set of $n$ data points in $\mathbb{R}^{d}$, the goal is to choose $k$ centers so as to minimize the total squared distance between each point and its closest center, also known as the within-group sum of squares.

To solve this problem exactly is in fact NP-hard, so instead an approximation algorithm is used:

  1. Choose $k$ initial centroids. The most basic method is to choose $k$ samples from the dataset $X$, although other variations exist.
  2. Assign each data point to its nearest centroid.
  3. Create new centroids by taking the mean value of all of the data points assigned to each previous centroid. Find the difference. This is the within-group sum of squares.
  4. Repeats steps 2. and 3. until the difference is less than a threshold.

Mathematically, k-means attempts to choose centroids that minimize the following objective function:

$$ \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)$$

where $x_j$ are data points and $\mu_i$ is the $i^{th}$ centroid.

980 questions
419
votes
5 answers

How to understand the drawbacks of K-means

K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just apply this algorithm which minimizes the sum of…
KevinKim
  • 6,347
  • 4
  • 21
  • 35
125
votes
7 answers

Clustering on the output of t-SNE

I've got an application where it'd be handy to cluster a noisy dataset before looking for subgroup effects within the clusters. I first looked at PCA, but it takes ~30 components to get to 90% of the variability, so clustering on just a couple of…
generic_user
  • 11,981
  • 8
  • 40
  • 63
120
votes
5 answers

What are the main differences between K-means and K-nearest neighbours?

I know that k-means is unsupervised and is used for clustering etc and that k-NN is supervised. But I wanted to know concrete differences between the two?
nsc010
  • 1,467
  • 2
  • 10
  • 9
99
votes
5 answers

What is the relation between k-means clustering and PCA?

It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction). However I am interested in a comparative and…
mic
  • 3,848
  • 3
  • 23
  • 38
89
votes
4 answers

How to produce a pretty plot of the results of k-means cluster analysis?

I'm using R to do K-means clustering. I'm using 14 variables to run K-means What is a pretty way to plot the results of K-means? Are there any existing implementations? Does having 14 variables complicate plotting the results? I found something…
88
votes
6 answers

How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?

How would you know if your (high dimensional) data exhibits enough clustering so that results from kmeans or other clustering algorithm is actually meaningful? For k-means algorithm in particular, how much of a reduction in within-cluster variance…
xuexue
  • 2,098
  • 2
  • 16
  • 11
84
votes
6 answers

Why does k-means clustering algorithm use only Euclidean distance metric?

Is there a specific purpose in terms of efficiency or functionality why the k-means algorithm does not use for example cosine (dis)similarity as a distance metric, but can only use the Euclidean norm? In general, will K-means method comply and be…
curious
  • 971
  • 1
  • 7
  • 7
61
votes
3 answers

Clustering with K-Means and EM: how are they related?

I have studied algorithms for clustering data (unsupervised learning): EM, and k-means. I keep reading the following : k-means is a variant of EM, with the assumptions that clusters are spherical. Can somebody explain the above sentence? I do…
61
votes
2 answers

Are mean normalization and feature scaling needed for k-means clustering?

What are the best (recommended) pre-processing steps before performing k-means?
pedrosaurio
  • 1,283
  • 2
  • 14
  • 19
60
votes
5 answers

Is it important to scale data before clustering?

I found this tutorial, which suggests that you should run the scale function on features before clustering (I believe that it converts data to z-scores). I'm wondering whether that is necessary. I'm asking mostly because there's a nice elbow point…
Jeremy
  • 1,259
  • 3
  • 12
  • 17
57
votes
11 answers

How to decide on the correct number of clusters?

We find the cluster centers and assign points to k different cluster bins in k-means clustering which is a very well known algorithm and is found almost in every machine learning package on the net. But the missing and most important part in my…
petrichor
  • 1,615
  • 2
  • 15
  • 17
49
votes
3 answers

Clustering a long list of strings (words) into similarity groups

I have the following problem at hand: I have a very long list of words, possibly names, surnames, etc. I need to cluster this word list, such that similar words, for example words with similar edit (Levenshtein) distance appears in the same cluster.…
Ufuk Can Bicici
  • 2,028
  • 1
  • 17
  • 26
38
votes
5 answers

Clustering a dataset with both discrete and continuous variables

I have a dataset X which has 10 dimensions, 4 of which are discrete values. In fact, those 4 discrete variables are ordinal, i.e. a higher value implies a higher/better semantic. 2 of these discrete variables are categorical in the sense that for…
36
votes
4 answers

Determine different clusters of 1d data from database

I have a database table of data transfers between different nodes. This is a huge database (with nearly 40 million transfers). One of the attributes is the number of bytes (nbytes) transfers which range from 0 bytes to 2 tera bytes. I would like to…
Shaun
  • 361
  • 1
  • 3
  • 3
36
votes
1 answer

How would PCA help with a k-means clustering analysis?

Background: I want to classify the residential areas of a city into groups based on their social-economic characteristics, including housing unit density, population density, green space area, housing price, number of schools / health centers / day…
enaJ
  • 567
  • 1
  • 6
  • 11
1
2 3
65 66