Questions tagged [k-means]

k-means is a method to partition data into clusters by finding a specified number of means, k, s.t. when data are assigned to clusters w/ the nearest mean, the w/i cluster sum of squares is minimized

K-means clustering attempts to achieve the following objective:

Given an integer $k$ and a set of $n$ data points in $\mathbb{R}^{d}$, the goal is to choose $k$ centers so as to minimize the total squared distance between each point and its closest center, also known as the within-group sum of squares.

To solve this problem exactly is in fact NP-hard, so instead an approximation algorithm is used:

Choose $k$ initial centroids. The most basic method is to choose $k$ samples from the dataset $X$, although other variations exist.
Assign each data point to its nearest centroid.
Create new centroids by taking the mean value of all of the data points assigned to each previous centroid. Find the difference. This is the within-group sum of squares.
Repeats steps 2. and 3. until the difference is less than a threshold.

Mathematically, k-means attempts to choose centroids that minimize the following objective function:

$$ \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2)$$

where $x_j$ are data points and $\mu_i$ is the $i^{th}$ centroid.

980 questions

419

votes

5 answers

How to understand the drawbacks of K-means

K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just apply this algorithm which minimizes the sum of…

asked Jan 16 '15 at 04:38

KevinKim

6,347
4
21
35

125

votes

7 answers

Clustering on the output of t-SNE

I've got an application where it'd be handy to cluster a noisy dataset before looking for subgroup effects within the clusters. I first looked at PCA, but it takes ~30 components to get to 90% of the variability, so clustering on just a couple of…

clustering interpretation k-means tsne

asked Feb 23 '17 at 01:39

generic_user

11,981
8
40
63

120

votes

5 answers

What are the main differences between K-means and K-nearest neighbours?

I know that k-means is unsupervised and is used for clustering etc and that k-NN is supervised. But I wanted to know concrete differences between the two?

machine-learning k-means k-nearest-neighbour

asked Apr 18 '13 at 17:15

nsc010

1,467
2
10
9

votes

5 answers

What is the relation between k-means clustering and PCA?

It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction). However I am interested in a comparative and…

clustering pca k-means

asked Nov 23 '15 at 22:42

mic

3,848
3
23
38

votes

4 answers

How to produce a pretty plot of the results of k-means cluster analysis?

I'm using R to do K-means clustering. I'm using 14 variables to run K-means What is a pretty way to plot the results of K-means? Are there any existing implementations? Does having 14 variables complicate plotting the results? I found something…

data-visualization classification k-means unsupervised-learning

asked Jun 25 '12 at 17:47

JEquihua

3,525
2
24
44

votes

6 answers

How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?

How would you know if your (high dimensional) data exhibits enough clustering so that results from kmeans or other clustering algorithm is actually meaningful? For k-means algorithm in particular, how much of a reduction in within-cluster variance…

clustering k-means

asked Jun 08 '11 at 00:04

xuexue

2,098
2
16
11

votes

6 answers

Why does k-means clustering algorithm use only Euclidean distance metric?

Is there a specific purpose in terms of efficiency or functionality why the k-means algorithm does not use for example cosine (dis)similarity as a distance metric, but can only use the Euclidean norm? In general, will K-means method comply and be…

clustering k-means distance-functions euclidean

asked Jan 07 '14 at 11:53

curious

votes

3 answers

Clustering with K-Means and EM: how are they related?

I have studied algorithms for clustering data (unsupervised learning): EM, and k-means. I keep reading the following : k-means is a variant of EM, with the assumptions that clusters are spherical. Can somebody explain the above sentence? I do…

machine-learning clustering data-mining k-means expectation-maximization

asked Nov 18 '13 at 11:47

Myna

votes

2 answers

Are mean normalization and feature scaling needed for k-means clustering?

What are the best (recommended) pre-processing steps before performing k-means?

clustering normalization k-means

asked Jan 17 '12 at 09:55

pedrosaurio

1,283
2
14
19

votes

5 answers

Is it important to scale data before clustering?

I found this tutorial, which suggests that you should run the scale function on features before clustering (I believe that it converts data to z-scores). I'm wondering whether that is necessary. I'm asking mostly because there's a nice elbow point…

clustering k-means

asked Mar 12 '14 at 21:27

Jeremy

1,259
3
12
17

votes

11 answers

How to decide on the correct number of clusters?

We find the cluster centers and assign points to k different cluster bins in k-means clustering which is a very well known algorithm and is found almost in every machine learning package on the net. But the missing and most important part in my…

clustering k-means

asked Feb 09 '12 at 14:45

petrichor

1,615
2
15
17

votes

3 answers

Clustering a long list of strings (words) into similarity groups

I have the following problem at hand: I have a very long list of words, possibly names, surnames, etc. I need to cluster this word list, such that similar words, for example words with similar edit (Levenshtein) distance appears in the same cluster.…

clustering k-means pattern-recognition

asked Nov 07 '14 at 10:32

Ufuk Can Bicici

2,028
1
17
26

votes

5 answers

Clustering a dataset with both discrete and continuous variables

I have a dataset X which has 10 dimensions, 4 of which are discrete values. In fact, those 4 discrete variables are ordinal, i.e. a higher value implies a higher/better semantic. 2 of these discrete variables are categorical in the sense that for…

clustering k-means discrete-data continuous-data gaussian-mixture-distribution

asked May 10 '12 at 10:44

ptikobj

votes

4 answers

Determine different clusters of 1d data from database

I have a database table of data transfers between different nodes. This is a huge database (with nearly 40 million transfers). One of the attributes is the number of bytes (nbytes) transfers which range from 0 bytes to 2 tera bytes. I would like to…

clustering k-means

asked Oct 15 '12 at 14:58

Shaun

votes

1 answer

How would PCA help with a k-means clustering analysis?

Background: I want to classify the residential areas of a city into groups based on their social-economic characteristics, including housing unit density, population density, green space area, housing price, number of schools / health centers / day…

clustering pca k-means dimensionality-reduction

asked Jun 18 '15 at 18:25

enaJ

2 3

…

65 66 Next