Suppose I have two forecasting models, A and B. Moreover, suppose I have multiple datasets and I use each of these sets to perform Diebold-Mariano tests. My aim is to find out which of the two models is best overall. So, I perform these tests and get a p-value for each.
What is the right method of combining the conclusions of each of my tests into an overall conclusion? More specifically, what is wrong (if anything) with the following two simple methods:
- Sum the z-scores across all tests and divide by the number of tests.
- Sum the p-scores across all tests and divide by the number of tests.
Intuitively, 1 seems better to me since it does not work directly on probabilities; the average would finally be converted to an overall p-value. But it also seems to me to be missing something, namely that greater frequencies of extreme observations should provide some kind of "bonus".