Using Adaboost for feature selection?

Question

Is it okay to use Adaboost to do feature selection (selecting a subset of dimensions $S$ from a high-dimensional feature vector $V$)? I divided the samples into four non-overlapping sets: $A$ (training1), $B$ (validation), $C$ (training2), $D$ (testing). There could be two procedures to do Adaboost feature selection:

Procedure 1:

i. Run Adaboost on sets $A$ and $B$ to determine a good subset of feature dimensions $S$ from high-dimensional feature vector $V$.

ii. Using only the low-dimensional features $S$ (a subset of $V$), train a SVM classifier using $C$ and evaluate it on $D$.

Procedure 2:

i. Run Adaboost on sets $A$ and $B$ to determine a good subset of feature dimensions $S$ from high-dimensional feature vector $V$.

ii. Using only the low-dimensional features $S$ (a subset of $V$), train a SVM classifier using $A$ and evaluate it on $D$.

It sounds that procedure 1 is more rigorous. But the problem is that in practice, it doesn't work. The good subsets $S$ is correlated with training set $A$. Thus, if you use another training set $D$, $S$ is no long good. It just behaves like a random subset from $V$.

So, is it appropriate to procedure 2? Is Adaboost appropriate for this case? Are there any better ways to discard bad features?

score 3 · Answer 1 · answered Mar 08 '13 at 18:23

When you use decision stumps as your weak classifier, AdaBoost will do feature selection explicitly. There could be other weak classifiers which won't let you select features easily.

I think you are complicating your training-testing protocol. Here is the most common scenario: A: training, B: validation, C: testing. You train on A, and adjust the parameters of your method (in AdaBoost's case, the number of weak classifiers to use) to maximize performance on B. After you selected the optimal parameters, you train on A+B, and test on C. A better way is to do k-fold cross validation on A+B.

score 1 · Answer 2 · answered Jun 02 '16 at 02:28

Tree-based methods like Adaboost can produce a list of relative variable importances that you can then use to rank-order your variables. Variable importance is measured by how much error the variable reduced each time it was used in a tree's split/branch. In your case, you could feed the top n-ranked variables into your SVM. In practice, I've seen modelers narrow down a list of 2,000+ to 10-15 using this method.

Here is a more thorough explanation of how the CART algorithm determines variable importances: https://www.salford-systems.com/blog/dan-steinberg/what-is-the-variable-importance-measure

Using Adaboost for feature selection?

2 Answers2