0

I'm doing some PCA on scaled features but where I also have some binary variables.

When I include the binary features they seem to really impact the PCA and I'm concerned that they will also disproportionately impact the cluster analysis I'm planning to do with kmeans too.

library(tidyverse)
library(factoextra)
df <- diamonds %>% select(depth:z) %>% 
  lapply(function(x) {(x - min(x)) / (max(x) - min(x))}) %>% as.data.frame()
df$cut = diamonds$cut
df_small <- df %>% sample_n(1000)
clust_fact <- df_small$cut %>% factor()
df_small <- df_small %>% select(-cut)
df.pc <- princomp(df_small)

pc1_cont <- fviz_contrib(df.pc, choice = "var", axes = 1)

When you then enter pc1_cont into the console you see: enter image description here

This is the contribution of each feature to PC1. You can see that price shows the most impact to PC1.

Now, if I add a binary feature, look what happens:

df <- diamonds %>% select(depth:z) %>% 
  mutate(carat_binary = if_else(diamonds$carat >= 0.8, 1, 0)) %>% # add binary feature
  lapply(function(x) {(x - min(x)) / (max(x) - min(x))}) %>% as.data.frame() # scale between 0 and 1
df$cut = diamonds$cut
df_small <- df %>% sample_n(1000)
clust_fact <- df_small$cut %>% factor()
df_small <- df_small %>% select(-cut)
df.pc <- princomp(df_small)


pc1_cont <- fviz_contrib(df.pc, choice = "var", axes = 1)

enter image description here

The new binary feature is shown to have the largest impact. This echos what I've found on my own actual data, each time I include a binary it shows as having the most impact.

I found some posts on here for handling this but struggled to understand the guidance. I was pointed to academic papers.

Is there a conventional approach to 'taking the edge off' binary features in PCA and clustering? In my mind it makes sense that they distort things since they will always be at the extreme of the sale of the entire data set, 0 or 1.

I was thinking of doing something crude like just transforming the true case as being the mean of across all scaled numeric variables in my data frame and then the false case as perhaps the 1st quartile value. But I'm thinking of arbitrary solutions here.

Is there a straight forwards approach in r for dealing with this?

Doug Fir
  • 1,218
  • 1
  • 13
  • 27
  • What problem are you trying to solve? You haven't described why it's a Bad Thing for the binary features to strongly influence PCA. – Sycorax Aug 07 '19 at 21:16
  • 2
    I'm trying to prevent my analysis from being too strongly influenced by the addition of 2 binary features out of a total of ~50 features. I'm clustering and I don't want the clusters to be too heavily determined by these binary vars – Doug Fir Aug 07 '19 at 21:39
  • 1
    Why are you trying to do this? As far as I can tell, doing so will make your cluster analysis worse. And what is "too heavily"? – Peter Flom Aug 08 '19 at 11:29
  • @PeterFlom when you say 'why are you trying to do this' do you mean why am I adding the two binary features? Are you then suggesting I drop them and don't include them per FR1's answer below? Are you aware of any scaling approaches to still be able to include the binary features? The reason I want to include them is that the stakeholder expressed particular interest in these two features. – Doug Fir Aug 08 '19 at 13:57
  • I'm asking why you want to make them less important. – Peter Flom Aug 08 '19 at 17:14
  • @PeterFlom because it looks like, since they are the extreme end of the scale by being either 1 or 0, they are distorting my analysis. See my second screen shot above where the addition of the binary variable becomes the largest contributer to pca1. Similar happened with my real data when I added binary vars. The rest of the features are between 0 and 1 but not many are at or close to either end – Doug Fir Aug 08 '19 at 17:20
  • The Impact on PCA is roughly related to the variance of a feature. Have you considered standardizing them first? You"ll then get yet another result - which should show how random your final result is when you apply all these scaling and weighting *heuristics*. On such "mixed type" data, you probably can get "any" result you want by choosing preprocessing and weighting, and there is no "right" way, clustering this data is an ill problem. – Has QUIT--Anony-Mousse Aug 10 '19 at 07:11
  • 1
    There *is* the hidden underlying assumption in PCA that all features are continuous, linear (price probably is too skewed!) and of equal importance and scale.p (price and x likely just *cannot* be compared). – Has QUIT--Anony-Mousse Aug 10 '19 at 07:15
  • @Anony-Mousse when you say ahve I tried standardising them, do you mean scaling my data? I did that, I scaled all features to be between 0 and 1. It's just that these two binary variables are at the extreme end, one or zero. What did you mean by standardize in this case? – Doug Fir Aug 10 '19 at 17:44
  • 1
    Standardization is a different way of scaling.not to [0:1], but based in the standard deviation. Note that I'm not saying it will be *better*, just different. The problem is that on your data, your *problem is ill-defined* already. So there is no "right" way. – Has QUIT--Anony-Mousse Aug 10 '19 at 18:19
  • @Anony-Mousse thanks for the suggestion. I will at least see what it looks like after standardising without necessarily expecting a 'better outcome. – Doug Fir Aug 10 '19 at 22:00

1 Answers1

1

I quote your comment “Is there a conventional approach to 'taking the edge off' binary features in PCA and clustering? In my mind it makes sense that they distort things since they will always be at the extreme of the sale of the entire data set, 0 or 1.” .... Exactly that is why in one of my answer Principle Component Analysis on categorical predictors I suggested that if they are just a few, leave them aside when computing PCA and transforming the other features especially if they are not very correlated between each other and do not need simplification. Look also here for a good discussion on the topic Doing principal component analysis or factor analysis on binary data

Fr1
  • 1,348
  • 3
  • 10