Ways to compare feature selection methods

Question

Context: A hyperspectral image is taken (here Indiana Pines) which needs to be reduced to a lower dimension from 200 bands for this GSA is to be used.

What will be possible metrics to grade various dimension reductions?

Work attempted so far:

Using KMeans clustering as a measure for the distribution. Problem is KMeans is highly dependent on the random_state and a simple relabelling would result in poor results;
Using the inter-point distance matrix to compare results. Problem is there are $\approx 2 * 10^4$ points So the matrix is of size $\approx 2 * 10^8$ which is computationally heavy;
Using a SVM over the data and grading based on accuracy. Problem is again fitting the SVM and scoring is computationally heavy so is not suitable metric for dimension reduction;
fraction of variance of the data points preserved. Problem: doesn't hold all data
Compute distance to neighbours and compare the original vs reduced dimensional distance matrices. Problem: not a normalized value

Any help will be appreciated.

Why do you want to preserve all variance? like in your bulletpoint 4): dimensionality reduction is achieved by selecting a certain part of dimensions, that can explain a fraction of the original variance and help reducing dimensions while still have a high explanatory power..and we can redo transformations https://stats.stackexchange.com/questions/229092/how-to-reverse-pca-and-reconstruct-original-variables-from-several-principal-com we do not want to preserve all data. — Patrick Bormann, Mar 16 '21 at 11:56
@PatrickBormann I meant by preserving data that the absolute variance is alone not a good criteria because the features selected this way will next be fed to a simple SVM to benchmark the feature selections over PCA/ICA/.. and variance preservation does not necessarily result in a good fitting atleast as far as I know,if there is such a basis please correct me — Girish Srivatsa, Mar 16 '21 at 12:23
If you mean fitting in terms of accuracy than this is not always the case, as dimension reduction can also work as noise detection/cancelling which distorted the data upfront. I believe you already googled and came across this article where they use a SVD dimension reduction and fit it to svm which in the end increases accuracy because of dimension reduction? https://blogs.oracle.com/r/using-svd-for-dimensionality-reduction — Patrick Bormann, Mar 16 '21 at 12:32
Yes because we had a question where we are supposed to use Genetic search algorithms to perform dimension reduction,most research papers on this split it as 2 parts feature selection and feature extraction but to do either one I need a robust and fast scoring method for the population.The question primarily had this as focus — Girish Srivatsa, Mar 16 '21 at 12:43
@PatrickBormann I could find no canonical methods/literature which provided a method to broadly compare various feature selection/extraction algorithms.This question was regarding that — Girish Srivatsa, Mar 16 '21 at 12:47
Feature selection itself can be achieved in several ways. I would recommend the sklearn doc, there they mention this article: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.24.4369&rep=rep1&type=pdf, but i believe it is not down with genetic algorithms because this depends on your fitting criteria, my GA I wrote selects features and params at random while saving only the best combination. Have you considered a GA that fits an extraTreeRegressor or ExtraTreeClassifier quick and dirty to do feature selection? as a fit criteria for your GA? — Patrick Bormann, Mar 16 '21 at 13:38
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/120919/discussion-between-girish-srivatsa-and-patrick-bormann). — Girish Srivatsa, Mar 16 '21 at 13:40

score 4 · Accepted Answer · edited Mar 16 '21 at 21:40

4

The reason that you haven't found a canonical answer to this question is that the best measure of efficacy of dimension reduction is the measure you will ultimately use the reduced dimensions for. When fitting a model in machine learning or statistics, the feature engineering and dimension reduction are part of the model, so their efficacy is judged on the same metrics that a model is judged on (MSE, MAPE, likelihood, etc). If you are reducing dimensions to plot them and make a decision, then the efficacy of dimension reduction should be judged on how it enables the decision of interest (power, minimum detectable effect, etc.). I recommend that your next step be to define precisely what the data will ultimately be used for and use that context to determine which method of dimension reduction works the best.

edited Mar 16 '21 at 21:40

Nick Cox

48,377
8
110
156

answered Mar 16 '21 at 21:32

R Carnell

2,566
5
21

Yes that is true but we were tasked with attempting to generate a Genetic Search for dimension reduction.If you see the chat with Patrick you will see that I have attempted most model fitting criteria which has the problem that time~50000 ms.Also HistGradientRegressor is faster~5000ms but scores PCA,SVD under original image – Girish Srivatsa Mar 17 '21 at 02:31
Also as mentioned in the question it is a hyperspectral image(Indiana Pines to be specific) and the model based scoring methods would be practical for small populations which is difficult to trust.So I wanted to use some unsupervised scorers to first get a good set of population followed by a model based scorer.This should have higher performance? – Girish Srivatsa Mar 17 '21 at 02:35
I’m not sure how your task relates to the question you asked here. If your task is dimension reduction using a genetic algorithm, then maybe this paper is useful. https://ieeexplore.ieee.org/document/850656. The paper is in the context of the accuracy of a classification model which is the objective of the feature reduction. Are you trying to classify pine trees from the spectral images? – R Carnell Mar 17 '21 at 02:42
In a way yes.The question was to perform dimension reductions of hyperspectral images there is no gaurantee on ground truths,but even under the assumption that ground truths are there as in this case.For a model to be trained and score accuracy the time required is high and restrict us to low population sizes,number of generations – Girish Srivatsa Mar 17 '21 at 02:55
Regarding the paper I have viewed it and attempted such methods but as mentioned high time is a constraint.So I thought of using efficient unsupervised methods to generate a "good" population to reduce my dependency on the random initial population of supervised methods – Girish Srivatsa Mar 17 '21 at 02:57

score 1 · Answer 2 · answered Mar 16 '21 at 23:33

Adding to the excellent answer by @RCarnell, I wanted to note that there are different approaches to dimensionality reduction with different levels of generality. The great thing about PCA is that it allows you to get rid of useless information if you know your signal-to-noise ratio. Namely, the eigenvalues significantly smaller than SNR may be discarded, as any information that may have been contained in them is already corrupted beyond recognition. The same philosophy applies to other techniques from this family such as ICA, FA, NMF.

However, if you have a noise-less data structure and you can't guarantee that small changes in the structure won't result in big changes in classification, then there is no way to proceed. One approach could be to try to find precise symmetries in the data. It is important to represent data in a sensible way, namely, split categorical data into separate dimensions, as well as splitting categorical data hidden in floating-point values into separate dimensions. Generally, expanding data into many dimensions all of which are as simple as possible before applying a dimensionality reduction procedure is a good way to go. Further, nonlinear dimensionality reduction techniques can be used to try to hunt for that dimensionality. Thinks like kernel PCA and even total correlation come to mind.

But, as @RCarnell said, the only thing that eventually matters is whether any given feature is actually predictive. Without prior knowledge, the only way to know is to check

Ways to compare feature selection methods

2 Answers2