4

I am looking at this tutorial: Factoextra R Package: Easy Multivariate Data Analyses and Elegant Visualization

Especially the contributions of the variables to the first 2 dimensions:

# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

Can this be used for feature selection - i.e. take all variables with values above red line?

enter image description here

I would not think so, if the first 2 dimensions do not describe much of the variance in the data. Is this correct. However, if the first 2 dimensions describe more than 80% of the variance maybe?

Thanks!

chl
  • 50,972
  • 18
  • 205
  • 364
cs0815
  • 1,294
  • 18
  • 30
  • See https://stats.stackexchange.com/search?q=scree+plot. In light of the literature on the subject (of choosing the number of principal components to use in a PCA), you might find it more constructive to inquire about what ways would be suitable in your case, rather than asking whether your approach might work (to which the answer is maybe, but only if you're lucky). – whuber Nov 06 '20 at 14:36
  • @whuber thanks. I know how to plot a scree plot and chose the top n pcs. I was more wondering about whether I can rank the features based on their quality and contribution on n dimensions/PCs. I personally do not think it makes sense to only rank features based on let us say 2 dimensions, if these 2 dimensions only account for 10% of the variance ... – cs0815 Nov 06 '20 at 14:45
  • Ranking doesn't necessarily depend on variance at all. But your question doesn't ask about ranking: it asks about "feature selection." – whuber Nov 06 '20 at 14:46
  • @whuber maybe I need to understand this a bit more. I was thinking about ranking the variables according to their contribution and quality and take the top n to select "good features". My current task is purely exploratory/unsupervised given a bunch of rows and 100! variables find something interesting (clear as mud). If I project the high dimensional data to 2 dimensions and plot the variables I only have a mess of characters ... – cs0815 Nov 06 '20 at 14:51

1 Answers1

3

We have a dedicated thread for that very specific purpose: Using principal component analysis (PCA) for feature selection.

Just a few points regarding the interpretation of those visual displays, and some reflexions on the question at hand:

  • This graphical output is a visual aid to see which variable contribute the most to the definition of the principal component. If you have a "PCA" object constructed using FactoMineR::PCA, then variable contribution values are stored in the $var$contrib slot of your object. The contribution is a scaled version of the squared correlation between variables and component axes (or the cosine, from a geometrical point of view) --- this is used to assess the quality of the representation of the variables of the principal component, and it is computed as $\text{cos}(\text{variable}, \text{axis})^2 \times 100$ / total $\text{cos}^2$ of the component.

  • It might not always be relevant to select a subset of variables based on their contribution to each principal component. Sometimes, a single variable can drive the component (this is sometimes known as a size effect, and it might simply result from a single variable capturing most of the variance along the first principal axis --- this would result in a very high loading for that variable, and very low loadings for the remaining ones); other times, the signal is driven by few variables in higher dimension (e.g., past the 10th component); finally, a variable might have a high weight on one component, yet also a weight that is above your threshold (10%) on another component: does that mean it is more "important" than those variables that only load on (or drive) a single component?

  • It will be hard to cope with highly correlated variables, yet one of the principled approach to feature selection is to get ride (sometimes simply as a side effect of the algorithm itself) of colinearity by selecting only one variable among the cluster of highly correlated variables.

  • Beware that any arbitrary cutoff (10% for variable contribution, or 80% for the total explained variance) should be motivated by pragmatic or computational arguments.

To sum up, this approach to selecting variables might work, when used in a single pass algorithm or as a recursive procedure, but it really depends on the dataset. If the objective is to perform feature selection on a mulitvariate dataset with a primary outcome, why not use techniques dedicated to this task (Lasso operator, Random Forest, Gradient Boosting Machines, and the like), since they generally rely on an objective loss function and provide a more interpretable measure of variable importance?

chl
  • 50,972
  • 18
  • 205
  • 364
  • Thanks makes all sense. I would usually use random forest etc. when I have a target variable, which I do not have in this case. I did some pca/famd and hierarchical clustering using all eigen values. Tbh the clusters did not makes any sense... – cs0815 Nov 08 '20 at 20:11