using Root Mean Squared Error (RMSE) to compare models with different sample size

Question

I'm using k-fold cross-validation to compare different models.

I splitted my dataset in 6 chunks and used 4 random chunks as training set and the remaining 2 as a test set.

Now I fitted n-different models to the training set and calculated the RMSE on both the training and the test sets. From what I understand, the model having the lower RMSE in the test set should be the preferable one.

For the sake of clarity for I mean: RMSE = sqrt( (fitted-observed)^2/ n.observations )

The models differ one another for some indipendent variables which have differenent amounts of NA values (in particular since some variables represent the cumulative effect of others I have that the number of NAs increases the more variables I cumulate).

So I find myself comparing a first model with say n NAs with a second one having 10n NAs. In this way I'm comparing models that are fitted to a different number of observations.

1) Is this an issue when comparing RMSE calculated on the test set?

I know for example that, if I was comparing models on the training set, the AIC would not be meaningful in this case, less sure for the R-squared...

2) since I run each model 10 times on 10 training sets and tested on 10 test sets (see beginning for explanations), for a given model I have average RMSE and its standard error on both training and test sets. How should I interpret differences between the training and test RMSE?

Any suggestion appreciated!

I think it's a bad idea to use RSME as a model comparison metric - even if you didn't have the problem of missing observations or different sample sizes. I say this because you may get two values of RSME that are very close to each other. How do you know if the small difference is statistically significant? You can't use the Central Limit Theorem (i.e Z-test) to see if the difference is statistically significant because the observations are *not* independent. In other words, I would abandon the RMSE altogether and use a more robust metric like the AIC or the BIC to compare models. — rocinante, Jul 03 '14 at 21:51
ISTR reading that a rule of thumb is that a 2% difference is insignificant, 10% somewhat significant, and 30% very significant. However, I have no idea what it's based on. — JenSCDC, Jul 03 '14 at 22:04
@rocinante I see your point but now: 1) is it possible to calculate AIC (or other stats) when using the model to predict values from new observations) 2) why then RMSE seems to be a common metric used in CV? — Filippo, Jul 03 '14 at 22:09
I will address 1) as an answer because there's not enough space in the comments. For 2), I think we need to remember that forecasting is a relatively under-developed area when you compare it to the rest of statistics. Sub-optimal methods like RSME are used because they're easy to apply and "better than nothing". To see what I mean, just look at your own thought process in regards to your problem. You want to evaluate your forecasting model through cross validation. 1/2 — rocinante, Jul 03 '14 at 23:06
2/2 Did your model specify how many steps ahead you were forecasting? Do you have a way of evaluating for how long the forecast is good for? On what basis did you decide the loss function? Do you have a method to perform forecast verification? Chances are no, because the process of forecasting has only been streamlined in meteorology. Most other fields make forecasts and then at whatever arbitrary future date just make another model altogether and forecast again, without regard to the adequacy or comparison of what happened before. — rocinante, Jul 03 '14 at 23:11

using Root Mean Squared Error (RMSE) to compare models with different sample size

0 Answers0

Linked