4

Is it OK to use kmeans with binary variables? I mean Euclidean distance? I guess the binary variables will be the ones that get the most power to determine the result.

Look at the following example:

data= data.frame(a=c(1,0,1,1), b=c(0.1,.2,.6,.8))
plot(data)
kmeans(data,2)
## Clustering vector: [1] 1 2 1 1

So the result is determined by the binary variable.

Is there a way to treat binary variables differently? Should I use Manhattan distance for all variables?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
GabyLP
  • 641
  • 6
  • 13
  • 3
    (1) K-means implies euclidean distance (only), but it does not work with matrix of pairwise distance at all. (2) You may do k-means with binary data if fractional means make sense for you, it implies that you treat the data as discretized rather than natural categorical or the mean has the meaning of the proportion, for you. – ttnphns Apr 26 '15 at 19:31

1 Answers1

2

K-means uses the mean.

Relevant properties of the mean:

  • minimizes the L2 errors (sum of squares, squared Euclidean distance)
  • is continuous
  • assumes linear data (see below for an example)

Technically, you can run k-means on binary data, but as you have observed there is a tendency for the algorithm to converge to local minima that are determined by single/few bits.

You can easily provoke the opposite effect, too. Scale your continuos attribute to 10000000 and the algorithm will ignore the binary attributes.

K-means assumes that all attributes are equally important; more precisely that a diffence of x has the same importance independent of the attribute where it occurs and the absolute values where it occurs. So the difference of a binary value is as important as the difference of \$0 to \$1 in price of a burger, or \$9999 to \$10000 when buying a house... I this invariance does not hold for your data, do not use k-means (or preprpcess your data until this seems to hold).

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • 1
    `assumes linear data`. In what sense or how "linear"? can you unwrap that point? – ttnphns Apr 27 '15 at 06:50
  • As given in the example in the bottom with the money. A \$1 difference at \$0 is not the same as it is at \$1000000 for most scenarios. In this sense, e.g. income is not a linear attribute. For an average person, \$1000 is a substantial raise, for a billionaire it is peanuts. – Has QUIT--Anony-Mousse Apr 27 '15 at 12:44