I started out looking for a way to test the difference between MSPE between two models (Question here), when (thanks to @Richard Hardy) I ended up reading a paper of Diebold regarding the Diebold-Mariano test (Comparing Predictive Accuracy, Twenty Years Later: A Personal Perspective on the Use and Abuse of Diebold-Mariano Tests).
Diebold claims two things that I find curious:
... the errors are driven by forecasts, not models
and
The DM test was intended for comparing forecasts; it has been, and remains, useful in that regard. The DM test was not intended for comparing models.
Since models are doing the forecasting, how come errors are not driven by models?
And if the DM test compares forecasts (generated by the models), how come this is not effectively comparing models?