0

so ive been selecting features for a regression problem and have obtained a list of the best performing feature sets. (note my list is actually several thousand lines long)

188.493 186.989 [379.45, 0.68, 99.51, 102.71, 109.91, 2.07] 50,12,48

188.352 187.391 [465.3, 0.63, 116.43, 134.18, 104.84, 2.3] 42,36,27

188.007 187.506 [443.08, 0.67, 93.73, 116.96, 110.67, 2.26] 50,42,27

185.867 192.012 [398.89, 0.81, 81.6, 99.44, 124.01, 2.41] 72,53,48

The first number is the MSE on 10foldCV on the training set while optimizing hyperparameters. The second number is the MSE on the test set. Third and fourth items are the hyperparameters and feature sets (not important)

My question is: would the best model be the model that performed very best on the test set? or should I also be concerned with how it performed on the training set. For example, my fourth line, performed well on training set but much worse on the test set, while the first line, performed better on the test than the training.

Should I be looking for feature sets that perform similar on both training data CV and test? or just take the best model on the test set?

Or would it be best to use a combination of models? Any help is greatly appreciated. Thanks

user79587
  • 35
  • 5

2 Answers2

0

In my experience, I'd use a model whose performance on training, cross validated and test data are the closest (not large gaps one way or the other) as this would likely be your most generalizable model. You haven't mentioned the size of your test set but generally, if you have only the cross validated and test MSEs as in your example, I'd pick the lowest cross validated MSE and ignore the test MSE.

veemo
  • 21
  • 3
  • Thanks for the reply, my training set that i did the CV and hyperparameter optmization on is about 1500 data points, then the validation set is about 1000 so im pretty sure its safe to assume if it performs about the same on the validation set, that it generalizes well. I have 400 or so points left for final testing but i believe im not supposed to use these at all for model selection. Im hesitant to pick the best MSE on the training data because im worried about overfitting. – user79587 Jun 03 '16 at 20:30
0

Pick the model that did best on the validation set. Models that do substantially better on the training set than the validation set have overfit and won't generalize well. Randomness will sometimes cause models to do better on a validation set than on a training set, but this is rare. The MSE on the validation set is not the final error though.

Important Note: You can't use either the training or validation set to gauge your selected model's accuracy in production -- you'll need a third dataset, called the "test" dataset to do that. As soon as you pick a final model, all previous MSE measurements of that model become biased (error looks lower than it actually is) because of the selection bias that occurs when you pick the model with the lowest MSE. You'll need a third data set, the "test" dataset to gauge the final/selected model's true accuracy.

See this question to better understand why you need a third dataset to gauge the final model's accuracy: What is the difference between test set and validation set?.

Ryan Zotti
  • 5,927
  • 6
  • 29
  • 33
  • thanks for the reply, I do have 400 datapoints or so left over for final testing, although i was under the impression not to use these in any way at all for model selection, only to test the final performance once a model has been selected. – user79587 Jun 03 '16 at 20:33
  • Perfect. That's the proper approach. – Ryan Zotti Jun 03 '16 at 20:39