How to determine significant subgroups of data inputs?

Question

I have a large $(10000 \times 5001)$ table representing $10000$ samples and $5001$ different features of these samples. One of these features represents an output variable of each sample. In other words, I have $5000$ input variables and one output variable for each sample.

I know that most of these inputs are irrelevant. Therefore, what I would like to do is determine the subset of input variables that predicts the output variable best. What is the best/simplest way to go about doing this in R?

to all reviewers: Please see that the answers in the duplicate question are NOT focused on binary classification, but can be applied to multiclass or regression problems as well. — mlwida, Apr 30 '13 at 09:05

score 0 · Answer 1 · answered Jun 03 '12 at 12:45

0

What people typically do is test the correlation between each feature and the response compute, save and order the p-values and then drop everything but a small percentage with the lowest p-values. Don't take the p-values seriously. This was just intended to be a quick screening device. Once you are down to a small enough number you can go ahead with the standard variable selection techniques used in regression.

answered Jun 03 '12 at 12:45

Michael R. Chernick

39,640
28
74
143

And could you please say what standard variable selection techniques are? – Olga Feb 07 '13 at 15:55

How to determine significant subgroups of data inputs?

1 Answers1