3

I have a large $(10000 \times 5001)$ table representing $10000$ samples and $5001$ different features of these samples. One of these features represents an output variable of each sample. In other words, I have $5000$ input variables and one output variable for each sample.

I know that most of these inputs are irrelevant. Therefore, what I would like to do is determine the subset of input variables that predicts the output variable best. What is the best/simplest way to go about doing this in R?

chl
  • 50,972
  • 18
  • 205
  • 364
Coder
  • 131
  • 1
  • to all reviewers: Please see that the answers in the duplicate question are NOT focused on binary classification, but can be applied to multiclass or regression problems as well. – mlwida Apr 30 '13 at 09:05

1 Answers1

0

What people typically do is test the correlation between each feature and the response compute, save and order the p-values and then drop everything but a small percentage with the lowest p-values. Don't take the p-values seriously. This was just intended to be a quick screening device. Once you are down to a small enough number you can go ahead with the standard variable selection techniques used in regression.

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143