2

I have data with 53 records and 52 variables and want to find a suitable predictive model. I think it makes sense to do some dimension reduction and select only a subset of predictors. My data contain 7 categorical predictors. The rest of the variables are numeric but not independent and with different scaling/distribution. My question:

Which method is useful for this structure of data to find out variable importance and reduce dimension (PCA, Random Forest, MARS)?

R_FF92
  • 251
  • 2
  • 9
  • [Related question](http://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont) – Gumeo Oct 08 '15 at 09:15

1 Answers1

3

Finding variable importance and reduce dimension are different tasks. You can :

  • Rank the variable according to the importance you suspect they have with respect to a target (sometimes referred as filtering). This could be: only retain the predictors correlated to the target, or rank them from a random forest importance scoring and keep the highest $n$% most important variables.
  • Perform a "blind" dimension reduction, regardless of the target (random projections, PCA...)

As you don't have many examples, I suspect that filtering will lead to overfit. As your numeric predictors are correlated, scaling and dimension reduction with PCA will be a first step.

As for the categorical predictors, just remove the scarce levels if there are any. Per example, say you decomposed a variable age in levels (20-29,30-39,40-49,50-59,60+) and you discover that the category 60 years+ is represented only once. Then it will be better to merge it with the category 50-59 (calling it 50+) so that this category has more observations (and you dropped a category with only one observations).

RUser4512
  • 9,226
  • 5
  • 29
  • 59
  • Thank you for the answer. Can you explain a bit more in detail what you mean with 'remove the scarce levels'? – R_FF92 Oct 08 '15 at 10:35
  • @fabian92 I added more details – RUser4512 Oct 08 '15 at 10:38
  • Ah ok I understand your words. But for me there is one important question left: The values of some numeric predictors are discrete or even 0/1. I treated them not as categorical because 1 should have more impact on the target as 0 (problem: in observations 90% 1 and 10% 0). Other variables are continous, for example with range 0-100. I am not sure how I should deal with those different distributions while performing a PCA – R_FF92 Oct 08 '15 at 11:05
  • What follows is just an opinion, I would just cross validate the model if I were you. But you could treat them as factors, and remove the columns is 95% (or more) are 0s (resp. 1s) – RUser4512 Oct 08 '15 at 15:06