How to deal with one hot encoded variables in a clustering problem?

Question

I'm using a dataset with customer card transactions to solve a clustering problem.

On a first approach, I'm trying K-means using R packages NbClust and cluster

My dataframe is normalized and it contains the following (sample):

as_tibble(full_dataset_log.stand)
# A tibble: 33,215 x 9
   monetary frequency recency_days GENDER_F0 GENDER_F1 GENDER_FNA
      <dbl>     <dbl>        <dbl>     <dbl>     <dbl>      <dbl>
 1   0.292    -1.10         1.02       1.28     -1.28     -0.0325
 2  -2.15     -1.10         0.301      1.28     -1.28     -0.0325
 3  -0.905     1.15        -0.614     -0.782     0.784    -0.0325
 4   0.968     1.77        -0.844     -0.782     0.784    -0.0325
 5   1.90      2.06        -2.15      -0.782     0.784    -0.0325
 6   1.90      2.06        -2.15      -0.782     0.784    -0.0325
 7  -1.10     -0.231       -0.423     -0.782     0.784    -0.0325
 8   1.55      1.77        -0.543     -0.782     0.784    -0.0325
 9   0.0536    0.196        0.0471    -0.782     0.784    -0.0325
10   0.523     0.0808       0.558     -0.782     0.784    -0.0325
# ... with 33,205 more rows, and 3 more variables:
#   GENDER_M0 <dbl>, GENDER_M1 <dbl>, GENDER_MNA <dbl>
>

This is the code I'm trying with 6 clusters:

k.means.fit_log <- kmeans(full_dataset_log.stand, 6)

My issue is how to deal with the GENDER variables which have been hot-encoded:

GENDER_F0
GENDER_F1
GENDER_FNA
GENDER_M0
GENDER_M1
GENDER_MNA

They just don't seem to make sense to have as separate variables and I was wondering how I can solve this problem.

Originally, the variables were:

GENDER_M: can be 0, 1 or NA
GENDER_F: can be 0, 1 or NA

Now, on this other question I wrote that hot encoding these variables didn't work out very well. I tried:

GENDER_M0: 1 for all the records that contain 0 in column GENDER_M - 0 otherwise
GENDER_M1: 1 for all the records that contain 1 in column GENDER_M - 0 otherwise
GENDER_MNA: idem
GENDER_F0: idem
GENDER_F1: idem
GENDER_FNA: idem

So, in total, I have 5 possible combinations:

NA/NA
0/0
0/1
1/0
1/1

1 means that there's a presence of the respective gender in the buying patters of the customer. For example, if a customer buys razors repeatedly, he will get a 1 in column GENDER_M.

Thanks for any help, I'm quite new to R and data science!

score 2 · Answer 1 · answered May 01 '19 at 11:37

Obviously his encoding is very badly performed.

A better encoding would use just 2 or 3 variables: M, F, maybe "other". NA can be simply encoded by setting neither of them, and I doubt you'll have many records with both or neither.

But in the end, your problem is much more fundamental. It's not about getting k-means to run - but about asking the right question. The question shouldn't be "how does k-means cluster the data if I make all these encoding and preprocessing choices" I assume. K-means is a least squares minimization technique. It attempts to find a good solution for "what is the least-squares reduction of this data to k vectors". But what good is a least square on your one-hot variables? Probably not worth running this.

In the end, you'll only find that your data is best represented by three clusters: males, females, and NA.

@Anony_Mousse, thanks for your help. Actually, it's not just male, female or NA, please see my edited question above. — Delete my account, May 01 '19 at 18:14
Clearly encoding 5 disjoint cases using 6 variables is redundant, isn't it? Plus, you ignore your domain knowledge about the problem to solve... — Has QUIT--Anony-Mousse, May 02 '19 at 01:00

score 1 · Answer 2 · answered May 01 '19 at 05:18

There are a couple of options for this case;

Look at the data dictionary (assuming you have it) and decode the meaning of the encodings.
If you do not have the data dictionary AND If you did not collect the data AND that the data was provided as it is in its current format, THEN decide whether you want this feature or not.

2.1. Assuming you want to keep this feature and you don't have the data dictionary, THEN I think best guess is to figure out if the feature values make any sense to you (based on common values etc).
If you collected the data then I think you should know what the feature encoding values are.
You did not collect the data, AND you do not have the data dictionary AND the feature values make no sense to you THEN discard the feature BECAUSE Computer is a dumb machine. If you can't interpret the feature value, don't expect any algorithm to do it for you either.

How to deal with one hot encoded variables in a clustering problem?

2 Answers2

Linked