2

My Problem: I'm trying to classify a data into two groups as A and B based on 25 observations (data point) and 100 features. I used the Gradient Boosting Machine (GBM) to find out which feature has the most important contribution to classify the data points into group A or B. As a result, I extracted the top 4 most important features for this classification. Let's say they are X1, X2, X3, and X4. Right now, I'm interested to plot all combinations of these 4 features versus each other (i.e. scatter plots) and find out if I could find out that the data points that belong to category A or B are clustered together in certain regions or not. For example, I plot X2 versus X1 as a scatter plot and use KMeans for clustering identification. But the problem is KMeans did a pretty bad job on each those pair scatter plots (i.e. scatter plot of Xi vs Xj 1 <= i,j <= 4).

My Approach: Based on bad results of KMeans in order to cluster the points, I decided to use PCA on these 4 features and reduce the dimensionality furthermore to two variables as X'1 and X'2. Then again I used KMeans to cluster the point and it significantly improved the KMeans results.

My Question: Does it make sense to use PCA on those top 4 features extracted from GBM (i.e. X1, X2, X3, and X4)? Cause I'm wondering we already reduced the dimensionality from 100 to 4 by using GBM and extracting top 4 most important features in our data. Does it make sense to use PCA and reduce those 4 features furthermore into two variables that have highest variances? If your opinion is no can you give me a precise explanation or possibly some references and if your opinion is yes again explanation and references could help me a lot.

  • Have you tried using PCA on all of the features? If so, are the results the same? – Carl Nov 13 '18 at 23:00
  • @Sycorax Thanks for comment! Can you elaborate the similarity of my problem to the linked answer a bit more probably in an answer? Cause in the linked question the OP asked about use a linear regression after random forest but my problem is built based on GBM. – Mithridates the Great Nov 19 '18 at 17:08
  • @Sycorax Really good point! I think it worth to be an answer to achieve better visibility for future users. – Mithridates the Great Nov 19 '18 at 17:34

1 Answers1

1

You're using so-called "feature importance" metrics derived from a tree-based model to determine the inputs to a linear method, just as in the linked post. Using a linear method doesn't make much sense when there are relevant nonlinear terms in the phenomenon you are studying; to a linear method, the nonlinearity is "invisible." By using GBM, your thesis is that the outcome is a nonlinear function of the inputs and is well-approximated by a tree structure. But by using PCA to tease out subpopulations, your thesis is that you just need to rotate the data. These are not consistent.

A cousin to this problem (replacing gradient-boosted trees with random forest, and replacing PCA with a GLM), with an answer elaborating about how linear and nonlinear models can give different answers to the same question, can be found in Can a random forest be used for feature selection in multiple linear regression?

Sycorax
  • 76,417
  • 20
  • 189
  • 313