Various models not improving basic rpart result

Question

I have a data set with 10,000 or so samples in it and 100 or so features. I've created a training set and test set and am trying to predict a numeric value. I've used rpart to determine the most important feature by having it generate just two nodes on the training set. I take that and apply it to the test set:

1) root 3288 159847905.80 34.59281022  
  2) FeatureOne< 0.455 1946  87599096.20 19.00446043 *
  3) FeatureOne>=0.455 1342  71090239.25 57.19707899 *

So I can see that only using a single FeatureOne improves the model fairly significantly.

My next steps are to use different models to try to improve upon this result. I would think that, in theory, training algorithms that take into account more variables should be able to improve upon the original, basic rpart result of only looking at a single variable. I include FeatureOne as well as many other features that I know are predictive (albeit slightly less than FeatureOne).

The problem I'm seeing is that I've tried several modeling algorithms and most of them underperform the basic rpart result that I saw originally. In fact, most are much, much worse than just looking at dataset[dataset$FeatureOne>=0.455,] from the original rpart result.

Here's a result using a support vector machine from the e1071 library using the svm command:

1) root 3243 156686211.90 32.84847980  
  2) predFromSvm< -76.292973 1738  79285639.65 18.49298044 *
  3) predFromSvm>=-76.292973 1505  76628786.36 49.42645847 *

rmse = 245.51
cor  = 0.1068

Here's a result using randomForest:

1) root 3243 156686211.90 32.84847980  
  2) predFromRf< 42.16955833 1631  71891832.78 17.54488657 *
  3) predFromRf>=42.16955833 1612  84025916.57 48.33245037 *

rmse = 220.78
cor  = 0.039

Here's a result using glm:

1) root 3243 156686211.90 32.84847980  
  2) predFromGlm< 32.33149937 1826  79956524.01 15.16580504 *
  3) predFromGlm>=32.33149937 1417  75422994.20 55.63504587 *

rmse = 292.29
cor  = 0.0490

Why is it that these algorithms are worse than simply looking at the most important variable as determined by a basic rpart command? I would think these algorithms would be significantly better than this since they are looking at many variables including the highly predictive FeatureOne.

Any suggestions on what to try next or ways that I might be using the algorithms incorrectly?

What's your definition of "better" or "worse?" Are you feeding the results from your other models back into rpart? This doesn't make a lot of sense to me. What's the RMSE look like on the test set for the various models? — Zach, Apr 08 '13 at 13:12
I use rpart as a convenient way to separate the results into "good" predictions and "bad" predictions. How can I compute the RMSE for the test set for each model? — Dave, Apr 08 '13 at 13:40
I agree with @Zach, rpart results here are meaningless (and actually pretty equivalent). You can calculate RMSE just by executing `sqrt(mean((pred-true)^2))`, where `pred` and `true` are respectively vectors of model predictions and true values, both on a test set. You can also try `cor(preds,true)` and `cor(preds,true,method="spearman")`. Also a scatterplot (`plot(preds,true)`) is useful as it helps to assess the structure of error. — , Apr 08 '13 at 14:40
OK, I added the RMSE and cor values for each model. How best to interpret these? — Dave, Apr 08 '13 at 15:35

score 0 · Answer 1 · answered Sep 18 '19 at 12:09

I would think that, in theory, training algorithms that take into account more variables should be able to improve upon the original, basic rpart result of only looking at a single variable.

This is the error in your thinking. Yes, adding predictors will reduce the bias in your model... but it will also increase the variance. The prediction error is, roughly speaking, the sum of the (squared) bias and the variance (the so-called "bias-variance tradeoff"). So whether the prediction gets better by adding predictors is a question of whether the reduction in bias outweighs, or is outweighed by, the increase in variance.

Adding predictors that are strong drivers of the outcome will typically improve predictions. Adding weak drivers may even make the predictions worse, because it adds more variance than it reduces bias. This can actually mean that a simpler misspecified model can perform better than a more complex correctly specified model. I wrote a little paper illustrating this (Kolassa, 2016, Foresight).

I would suggest looking at your data, trying to identify badly mispredicted instances, and learning from this. See also this.

Various models not improving basic rpart result

1 Answers1