My Problem: I'm trying to classify a data into two groups as A and B based on 25 observations (data point) and 100 features. I used the Gradient Boosting Machine (GBM) to find out which feature has the most important contribution to classify the data points into group A or B. As a result, I extracted the top 4 most important features for this classification. Let's say they are X1, X2, X3, and X4. Right now, I'm interested to plot all combinations of these 4 features versus each other (i.e. scatter plots) and find out if I could find out that the data points that belong to category A or B are clustered together in certain regions or not. For example, I plot X2 versus X1 as a scatter plot and use KMeans for clustering identification. But the problem is KMeans did a pretty bad job on each those pair scatter plots (i.e. scatter plot of Xi vs Xj 1 <= i,j <= 4).
My Approach: Based on bad results of KMeans in order to cluster the points, I decided to use PCA on these 4 features and reduce the dimensionality furthermore to two variables as X'1 and X'2. Then again I used KMeans to cluster the point and it significantly improved the KMeans results.
My Question: Does it make sense to use PCA on those top 4 features extracted from GBM (i.e. X1, X2, X3, and X4)? Cause I'm wondering we already reduced the dimensionality from 100 to 4 by using GBM and extracting top 4 most important features in our data. Does it make sense to use PCA and reduce those 4 features furthermore into two variables that have highest variances? If your opinion is no can you give me a precise explanation or possibly some references and if your opinion is yes again explanation and references could help me a lot.