-1

I have a relatively big (50000 rows) dataset with 165 columns. The response variable is integer varying in the interval (200-900). The predictors are mixed integers and categorical variables.

I try to decide what is the best method to apply for prediction since I need very low absolute error. Also the method is necessary to finish in a reasonable time interval (random forest with method=anova took over 3 hours and I had to interrupt it).

I use R

gbarel
  • 11
  • 2

1 Answers1

0

165 columns are too much for any model to run smoothly (quickly). In cases like this, feature selection methods are suggested to apply (for example, dimensionality reduction techniques). PCA (Principal Component Analysis) is one of the most popular dimensionality reduction techniques. With the reduced number of features you are then welcome to apply different models and test their accuracy.

Note 1: If you use R, you may read about PCA here.

Note 2: If your categorical variables are not ordinal, it is suggested to convert them into dummy variables. To read why, please, refer to this question.

  • Thank you Hrant. First I need to say that my categorical variables are already ordinal. Second I have considered PCA but with this the difficulty is moved forward to try to predict and mainly to communicate the results. But I'll try it nevertheless. – gbarel Apr 23 '17 at 07:40