I have a data set with 10,000 or so samples in it and 100 or so features. I've created a training set and test set and am trying to predict a numeric value. I've used rpart to determine the most important feature by having it generate just two nodes on the training set. I take that and apply it to the test set:
1) root 3288 159847905.80 34.59281022
2) FeatureOne< 0.455 1946 87599096.20 19.00446043 *
3) FeatureOne>=0.455 1342 71090239.25 57.19707899 *
So I can see that only using a single FeatureOne improves the model fairly significantly.
My next steps are to use different models to try to improve upon this result. I would think that, in theory, training algorithms that take into account more variables should be able to improve upon the original, basic rpart result of only looking at a single variable. I include FeatureOne as well as many other features that I know are predictive (albeit slightly less than FeatureOne).
The problem I'm seeing is that I've tried several modeling algorithms and most of them underperform the basic rpart result that I saw originally. In fact, most are much, much worse than just looking at dataset[dataset$FeatureOne>=0.455,] from the original rpart result.
Here's a result using a support vector machine from the e1071 library using the svm command:
1) root 3243 156686211.90 32.84847980
2) predFromSvm< -76.292973 1738 79285639.65 18.49298044 *
3) predFromSvm>=-76.292973 1505 76628786.36 49.42645847 *
rmse = 245.51
cor = 0.1068
Here's a result using randomForest:
1) root 3243 156686211.90 32.84847980
2) predFromRf< 42.16955833 1631 71891832.78 17.54488657 *
3) predFromRf>=42.16955833 1612 84025916.57 48.33245037 *
rmse = 220.78
cor = 0.039
Here's a result using glm:
1) root 3243 156686211.90 32.84847980
2) predFromGlm< 32.33149937 1826 79956524.01 15.16580504 *
3) predFromGlm>=32.33149937 1417 75422994.20 55.63504587 *
rmse = 292.29
cor = 0.0490
Why is it that these algorithms are worse than simply looking at the most important variable as determined by a basic rpart command? I would think these algorithms would be significantly better than this since they are looking at many variables including the highly predictive FeatureOne.
Any suggestions on what to try next or ways that I might be using the algorithms incorrectly?