I am implementing a general purpose prediction tool for time series. I want to tolerate missing values, so I decided to settle for DLMs. To make it as relevant as possible on a large number of datasets, I want it to try several different models and select the best parameters. It would then operate the prediction with the one that fits the best. This should allow me to extract as many relevant patterns as possible to make the forecasting as relevant as possible.
Here is my interrogation: in most papers, meaning all the sources I've read until now, they want to use likelihood and and other similar criteria like AIC. This doesn't seem optimal in the case of forecasting. You evaluate if your model fits statistically, but that doesn't tell you if it will predict well future values. I think it would make way more sense to assess models on their predictive power within the data set. For instance you could compute the mean square error between predictions at all intermediate time points and actual realizations of your time series. This is possible thanks to the recursive nature of a DLM. With this technique you wouldn't take the risk of over-fitting, as you are assessing forecasting capacity and forecasting capacity is what you are looking for. Do you see any reasons why I would be wrong ? Why is everybody using maximum likelihood ? Have you seen any references that use something close to what I'm suggesting ?