Using PCA for feature selection?

Question

I (think I) understand the process of PCA and the advantages it offers in pre-processing data for classification models / lower-order visualisation. I also understand you can look at each Principal Component and see the loadings of each feature.

Lets say I have a live data set in which I record 100 or so features (aka columns) for the samples in my data, lets say each feature takes about the same time/effort/cost to measure.

I do a PCA and if find that that 99% of the variance is explained by the first 50 Principal Components. Great, now I can trim my data before running classification models. This saves me time and effort.

Now it may be the case that there are features that have loadings near 0 for all of the first 50 Principal Components - so are near useless - and are a waste of time for me to measure.

Is there a practical way of detecting these 'useless' features in the PCA? Are there any cut-offs that are usually used? Is this using a sledge-hammer to crack a nut? Should I just use univariate analysis to find useless features?

I understand it is similar to this question Using principal component analysis (PCA) for feature selection. But the answers there do not give any pragmatic rules of thumb or any methods for defining a non-informative feature. Nor does it compare using such methods to other methods of removing non-informative features.

This Stack-overflow explains some methods of comparing feature importance using the Iris dataset but does not show how one would choose a feature to drop. https://stackoverflow.com/a/50845697/3562522

This example of PCA uses graphical methods to look at the top 40 features in each principal component to gain insights, but again does not attempt to find useless features.

It seems one algorithm to do this would be as follows: Perform PCA, define a cut-off of PCs, (50 in this case). For each remaining Principal Component multiply the loadings by the 'proportion of variance explained'. Then take the sum of the all of the absolute loadings. Those features with the smallest weighted absolute sums of loadings have been the least informative to your PCA. — Harvs, Jan 25 '21 at 17:36
PCA does not tell you what features you should keep or what features are useful for a particular task. It tries to summarize all of your features in as few dimensions as possible. A feature could easily be entirely left out of the first dozen PCs, and also be the only important variable (assuming your ultimate goal is to make some sort of prediction). Some questions to consider 1) why are you applying PCA 2) why do you want or think you need to perform feature selection? — Ryan Volpi, Jan 25 '21 at 20:57
Feature selection is a fraught topic in general, & PCA is somewhat orthogonal to that. You may want to read through some of the top threads categorized under the [tag:feature-selection] tag. In addition, it isn't necessarily true that the high-variance PCs are 'good' & the low variance PCs are 'bad', see: [Examples of PCA where PCs with low variance are “useful”](https://stats.stackexchange.com/q/101485/). So, I don't really see how this thread can lead to any concrete "pragmatic rules of thumb or any methods for defining a non-informative feature". — gung - Reinstate Monica, Jan 25 '21 at 21:01

score -1 · Answer 1 · answered Jan 25 '21 at 19:27

There are some ways to find/eliminate non-informative features, before performing PCA or other dimensionality reduction methods.

1- Remove highly correlated features: This can be done by calculating pair-correlation and removing one of the features that has high correlation (e.g. >0.95) with the other one. Sample code is here

2- Remove features with a lot of ZEROs: In this case, your threshold may be 1-5% so if a feature has less than 1% of non-zero values (more than 99% zeros), may not be informative.

for col in df.columns:
    percent = 100*sum(df[col]>0)/df.shape[0]
    if percent<=3:
        print(f"{col}   {percent:.1f}")

3- Remove features with constant values: This case is also covers method 2 (zeros). You calculate the variance for each column and remove features whose variance is lower than threshold, for instance var<1. However, this method is so sensitive to outliers. To overcome this issue, you can convert it to method 2 by subtracting median of the feature and then count non-zero rows.

for col in df.columns:
med = np.median(df[col])
percent = 100*sum(df[col]-med>0)/df.shape[0]
if percent<=3:
    print(f"{col}   {percent:.1f}")

Using PCA for feature selection?

1 Answers1