I have read some related questions, such as: Why are mixed data a problem for euclidean-based clustering algorithms?, What data structure to use for my cluster analysis or what cluster analysis to use for my data?, K-Means Clustering with Dummy Variables, k means with binary variables, How to use both binary and continuous variables together in clustering?, etc.
I can't quite get a consensus feeling or many references posted on the following questions.
I am using the DBSCAN algorithm to perform cluster analyses. This requires numeric data. I have categorical variables, such as race, that I turn into numeric data by creating dummy-coded binary variables. (I also normalize all continuous variables in the unit interval.) In the case of when there are only two levels of the categorical variable, this creates perfect correlation between the two.
For instance, take gender. Making dummies from this variable creates gender_male
and gender_female
. These two are perfectly correlated, meaning: $\text{Pr}(\text{F} = 1| \text{M} = 0) = 1$, which also implies $\text{Pr}(\text{F} = 0| \text{M} = 1) = 1$.
In the data I have been using, I generally get one cluster and a number of outliers. I had two questions regarding this:
It seems like the typical k-means clustering cannot be used with binary variables. Is there an authoritative reference on this, laying out the reasons why?
DBSCAN uses the same distance measures as k-means does—it just assigns points to clusters differently. Does this mean that binary variables (i.e., either zero or one) are not useful for DBSCAN, either? If not, how could I best represent those categorical data (e.g., gender, race, political affiliation, whether or not people play video games) in clustering algorithms?
Does having highly related (or perfectly correlated) variables in the clustering algorithm throw off DBSCAN? I know why it is an issue in something like ordinary least squares, but those concerns do not apply to the algorithm being used here. It doesn't appear to change anything, as the results I get below don't shift when I drop the perfectly redundant variable:
library(dbscan)
## generating data -------------------------
set.seed(1839)
n <- 5000
x <- rbeta(n, .4, .4)
y <- rbinom(n, 1, x)
z <- ifelse(y == 1, 0, 1)
dat <- data.frame(x, y, z)
## selecting eps ---------------------------
kNNdistplot(dat, 9)
abline(h = .0045) # at the elbow
## clustering ------------------------------
# redundant column
set.seed(1839)
c1 <- dbscan(dat, .0045, 9)
# no redundant columns
set.seed(1839)
c2 <- dbscan(dat[, -3], .0045, 9)
## equivalent? -----------------------------
all.equal(c1$c, c2$c)