I recently read a paper in which the authors claim that in order to compare the forecasting performance of two non-nested models, models A and B, a valid procedure is to fit models A and B on the same data set, and compare the average likelihood of the fitted models computed on the hold-out data set. All that is required is that models A and B are expressed as probability densities for the same variable. I am using the language of the paper: strictly speaking the quantities being compared are fitted models evaluated at data points, not likelihoods. Accepting this likelihood as the forecasting accuracy metric, this method for checking forecast accuracy has some intuitive appeal, although I have misgivings. Can these likelihoods be meaningfully compared without some sort of normalization? The out of sample probabilities will not add up to one. I'm not able to come up with a straightforward example in which this test would give a spurious result.
update: I was able to produce a simple example in which direct comparison of out of sample likelihoods gave a misleading result: let the dependent variable y be a linear trend plus normally distributed error term, and let the explanatory variable x be a linear trend plus independent normally distributed error term. Generate 100 points for y and x. Model A is a linear regression, while for model B I used a regression with Student t-distributed errors (with two degrees of freedom). I trained on the first 50 points and tested on the second set of 50 points, and repeated with training and test sets interchanged. I repeated with three choices of variance for the data generating process. Model B gave higher average out of sample likelihoods in all cases. This example is a bit contrived but does illustrate my concern.