Assess calibration of a density forecast by Kolmogorov-Smirnov test on PIT of realized values

Question

According to Elliott & Timmermann "Economic Forecasting" (2016) p. 429-430,

Calibration requires that if a density forecast assigns a certain probability to an event, then the event should occur with the stated probability over successive observations.
<...>
For any <...> event, $A$, if the associated density forecast $\int_A p_Y (y|z)(y)\ dy = p$, calibration requires that $P(y_{t+1} \in A)$ is indeed equal to $p$, conditional on the same information.

I wonder how one could assess calibration of a density forecast. I think Kolmogorov-Smirnov test applied on the probability integral transform (PIT) of realized values vs. the theoretical Uniform[0,1] distribution could be used for that, following Section 3 of Diebold et al. (1998). The PIT would be based on the distribution that is implied by the density forecast. However, use of the test is not mentioned in the textbook (it says Most attempts to examine calibration lead to informal rather than formal hypothesis tests and goes on to discuss some difficulties with assessing calibration), so I am probably missing something.

Q: Does Kolmogorov-Smirnov test applied on the probability integral transform (PIT) of realized values vs. the theoretical Uniform[0,1] distribution assess calibration of a density forecast?

References

Diebold, F. X., Gunther, T. A., & Tay, A. S. (1998). Evaluating Density Forecasts. International Economic Review, 39(4), 863-883.
Graham, E., & Timmermann, A. (2016). Economic Forecasting. Princeton University Press.

Do you mean the probability integral transform using the expected distribution or *empirical* probability integral transform? — Dave, Jun 09 '21 at 14:39
@Dave, using the distribution implied by the density forecast. — Richard Hardy, Jun 09 '21 at 15:30
I don't know the answer to this but suspect it depends on similar considerations as the "evaluating models" vs "evaluating forecasts" distinction of the West-Clark-McCracken and Diebold-Mariano strands of literature for point forecasts. The latter point of view is typically simpler to test. As in the book you cite, personally I prefer looking at various diagnostic plots of the PIT rather than doing formal hypothesis tests because it can reveal directions for potential improvement (e.g. tails are too thin, should include skew, etc). — Chris Haug, Aug 19 '21 at 15:02
@ChrisHaug, thanks! I thought the distinction was more about evaluating models at true parameter values (West-Clark-McCracken) which is irrelevant in practice vs. at estimated parameter values (Diebold-Mariano) which is the only thing that is relevant. I am aware of Diebold's and others' thoughts on evaluating forecasts, not models, but I think that is again of limited relevance given that data generating processes change over time and we hardly ever have a sample in which the DGP is constant so that their efficiency arguments fully apply. — Richard Hardy, Aug 20 '21 at 07:17
@ChrisHaug, in any case, I am not going to be evaluating models at true parameter values, so West-Clark-McCracken is not relevant for me. Regarding plots vs. formal tests, this is a general argument that applies in many situations and for many tests. I buy it. On the other hand, we still have formal tests, and I would like to apply one in the setting I am facing e.g. for satisfying a reviewer. Thus the question. KS test on PIT seems like an obvious thing to do, and I think I have seen it used in similar contexts. But if it is not in the textbook, then I wonder what is wrong with it... — Richard Hardy, Aug 20 '21 at 07:21
It seems roundabout. Why not test the observed results against the theoretical? Why do the transformation? I might simulate this later and post my code in an answer. — Dave, Aug 24 '21 at 11:39
@Dave, not sure I understand you. We *are* testing observed results against theoretical. Theoretically, the PIT values should be Uniform[0,1], and we are testing the observed PIT values against that. The idea was proposed in Diebold FX, Gunther TA and Tay AS (1998). ["Evaluating Density Forecasts with Applications to Financial Risk Management."](https://www.nber.org/papers/t0215) *International Economic Review, 39* (4), 863-883, Section 3. I am not sure why the textbook is not referring to it. (I should probably update the question now that I have found the reference I had forgotten before.) — Richard Hardy, Aug 24 '21 at 12:19
You have some data and transform with some CDF, right? Why not compare the data and the CDF? — Dave, Aug 24 '21 at 12:27
There is even an R function in the `GAS` package doing that in the context of density forecasting: [`PIT_test`](https://www.rdocumentation.org/packages/GAS/versions/0.3.3/topics/PIT_test). Regarding why not compare data with CDF: because CDF may be different for each datapoint under $H_0$ of correct density forecast, making it inconvenient to test. How would one construct a test statistic from that? Meanwhile, PIT will be Uniform[0,1] for each datapoint under $H_0$. I think the main argument for choosing PIT over raw data is convenience. — Richard Hardy, Aug 24 '21 at 12:28
@ChrisHaug, I found a relevant reference: Diebold et al. (1998) (see updated post). I wonder if I might be misusing it. If not, I do not see why EIlliott & Timmermann (2016) do not cite it there and do not endorse the method. — Richard Hardy, Aug 24 '21 at 12:37
Richard, I second [your rationale for why one usually uses the PIT](https://stats.stackexchange.com/questions/529963/assess-calibration-of-a-density-forecast-by-kolmogorov-smirnov-test-on-pit-of-re/541269#comment993911_529963). Its distribution under H0 is uniform, and it's much easier to use one of a plethora of tests for uniformity, rather than "roll one's own" test for whatever other distribution one suspects. (Often, tests for uniformity z-transform the data and then test for normality.) — Stephan Kolassa, Aug 24 '21 at 13:37

score 0 · Answer 1 · answered Aug 24 '21 at 13:35

First off, the PIT is not uniform in the discrete case, so the answer is "no" here. You can use randomization, an approach which apparently has been invented multiple times independently, but I suspect you are thinking of continuous densities, so let's assume this from now on.

The IMO canonical reference is Gneiting, Balabdaoui & Raftery (2007, Journal of the Royal Statistical Society: Series B). They define multiple different reasonable flavors of calibration (probabilistic, marginal and exceedance), show by examples that they are indeed logically independent and note that

Probabilistic calibration is essentially equivalent to the uniformity of the PIT values.

So in principle, yes, you could use the K-S test here. (Shameless piece of self-promotion here: an alternative would be data-driven tests for uniformity, which I used in Kolassa, 2016, IJF.)

However, Gneiting et al. write:

Uniformity is usually assessed in an exploratory sense, and one way of doing this is by plotting the empirical CDF of the PIT values and comparing it with the CDF of the uniform distribution. ... However, the use of formal tests is often hindered by complex dependence structures, particularly in cases in which the PIT values are spatially aggregated.

Gneiting et al. go on to discuss other diagnostic tools for probabilistic and other flavors of calibration, and give a number of pointers to literature. (Their main recommendation is proper scoring rules.) I would recommend you take a look at this paper, and at subsequent papers that cite it.

Thank you, that makes sense. In Diebold's own textbook "Forecasting: In Economics, Business, Finance and Beyond", he suggests using KS test but then immediately tells that it is not constructive, i.e. it does not tell where the violation is coming from. But that is enough for me. Funny that I had forgotten Diebold's work when I posted the question. I still wonder why Elliott & Timmermann do not even mention this possibility... — Richard Hardy, Aug 24 '21 at 14:03

Assess calibration of a density forecast by Kolmogorov-Smirnov test on PIT of realized values

1 Answers1

Linked