Is it OK to use kmeans
with binary variables? I mean Euclidean distance? I guess the binary variables will be the ones that get the most power to determine the result.
Look at the following example:
data= data.frame(a=c(1,0,1,1), b=c(0.1,.2,.6,.8))
plot(data)
kmeans(data,2)
## Clustering vector: [1] 1 2 1 1
So the result is determined by the binary variable.
Is there a way to treat binary variables differently? Should I use Manhattan distance for all variables?