Testing for clairvoyance (or performance of a model) where the predictions are intervals

Question

I wish to devise a test that determines whether or not an individual is clairvoyant (or if a black-box model works). Let us assume that the clairvoyant believes that they can estimate a person's height (or any other statistic like income whose distribution we know) by their name (or a mental model incorporating multiple factors that we do not know).

We randomly sample $n$ people from the population with heights $h_{i}$, $i \in \{1,2,...n\}$. The clairvoyant gives $n$ intervals of height (in cms) as guesses e.g. $I_{1} = (162, 180), I_{2} = (152, 154)..., I_{n} = (134,155).$ The clairvoyant is deemed correct if a person's height $h_{i} \in I_{i}$. We know what the distribution of height is for the population and we can calculate the probability of a randomly selected person's height falling in an interval. In order to establish whether the individual is a clairvoyant, we need to decide what cut-off we choose for the hit rate (the number of times the clairvoyant in question is correct). How does one compute such a cut-off and how does one devise a test to figure out how competent the clairvoyant in question is? Or is computing errors the only way around this?

You essentially have a set of Binomial trials, where the probability of success changes with each trial (dependent on what interval the clairvoyant selected). See https://stats.stackexchange.com/questions/9510/probability-distribution-for-different-probabilities. You use that to find the distribution of "number of correct guesses" (given the guesses that were made), which you can use to find the likelihood that the person got at least N correct. Of course, you should set what level of evidence you'll require to believe this person is clairvoyant beforehand. — Nuclear Hoagie, May 11 '20 at 15:46
[Forecast accuracy metric that involves prediction intervals](https://stats.stackexchange.com/q/194660/1352) may be helpful. — Stephan Kolassa, May 11 '20 at 15:57

score 1 · Answer 1 · answered May 11 '20 at 15:56

1

Don't use the hit rate as a quality measure for interval predictions. (Or if you do, do not be surprised if your winning algorithm predicts an interval of $(0,300)$ for all instances and gets a hit rate of 100%.)

Your quality measure needs to balance coverage and length of the prediction intervals: yes, we want high coverage, but we also want short intervals.

There is a quality measure that does precisely this and has attractive properties: the interval score. Let $\ell$ and $u$ be the lower and the upper end of the prediction interval. The score is given by

$$ S(\ell,u,h) = (u-\ell)+\frac{2}{\alpha}(\ell-h)1(h<\ell)+\frac{2}{\alpha}(h-u)1(h>u). $$

Here $1$ is the indicator function, and $\alpha$ is the coverage your algorithm is aiming for. (You will need to prespecify this, based on what you plan on doing with the prediction interval. It makes no sense to aim for $\alpha=100\%$ coverage, because the resulting intervals will be too wide to be useful for anything.)

You can then average the interval score over many predictions. The lower the average score, the better. See Gneiting & Raftery (2007, JASA)] for a discussion and pointers to further literature. A scaled version of this score was used, for instance, in assessing predictions intervals in the recent M4 forecasting competition.

Now, as to whether your algorithm is clairvoyant or your black box "works"... well, you will need to figure out whether it is "clairvoyant enough". A clairvoyant should be able to perfectly predict all heights, shouldn't they? So all $u=\ell=h$, and the score should be zero. This sounds like a rather high (or low) bar to clear. So the question really is whether your algorithm is good enough, or better than some competing algorithm or a simply benchmark - for instance, you should certainly test whether your algorithm performs better than just taking empirical intervals over all your training data, which would be the simplest naive benchmark. This may be helpful once you have arrived at this stage.

answered May 11 '20 at 15:56

Stephan Kolassa

95,027
13
197
357

This scoring rule would give the same score to a clairvoyant who chooses some interval and consistently sees the true height come in half an interval too low, and one who chooses an interval twice the size and is always correct. I agree that the problem becomes trivial if you allow arbitrarily large prediction intervals, but it seems strange that this might rank an always-wrong prediction as better than an always-right prediction with a bigger interval. – Nuclear Hoagie May 11 '20 at 18:43
@NuclearWang: I'm afraid I don't quite follow. What is "half an interval too low" and "always correct" in this context? Can you perhaps illustrate with some specific numbers? – Stephan Kolassa May 11 '20 at 18:59
Suppose they predict [140, 160], and the actual height is 130 (half the interval range below the lower bound) - this prediction is incorrect. Taking a=1, the score is 20+2x10 = 40. If the prediction is [130, 170], the score is also 40, but this prediction is correct. Extending this to many samples, the score suggests that the incorrect predictor is just as "good" as the correct one, just because its intervals are smaller. – Nuclear Hoagie May 11 '20 at 19:10
@NuclearWang: $\alpha=1$ does not make sense, we need a coverage $0 – Stephan Kolassa May 11 '20 at 19:29
@NuclearWang: also, I don't think looking at precise outcomes is very helpful. After all, we are evaluating prediction intervals here. So let's posit some true distribution, say $N(0,1)$. Then the interval score is minimized in expectation by the correct $(q_{\frac{\alpha}{2}},q_{1-\frac{\alpha}{2}})$ prediction interval, and there is no possibility for "gaming" it, i.e., getting a lower expected score by specifying some other interval. – Stephan Kolassa May 12 '20 at 07:20
But changing alpha just changes what width of prediction intervals are "equivalent", nothing fundamentally changes with different alpha. In my example, when alpha=0.5, an incorrect prediction interval is as good as a correct one that's 3x as big - for a value of 130, the incorrect interval [140, 160] now gives a score of 60, just like the correct interval of [120, 180]. We're still saying saying that incorrect predictions with smaller intervals are as good as correct ones with larger intervals, and I don't quite get why we'd want to score a predictor that's correct the same as one that's not. – Nuclear Hoagie May 12 '20 at 12:54
@NuclearWang: take a look at my last comment. Looking at single outcomes is not useful. We have to look at prediction intervals for *distributions*, and assess scores *in expectation*. Assume the correct distribution is $U[90,210]$, and $\alpha=0.5$, so we are aiming for a $(q_{.25},q_{.75})$ PI. An interval $(120,180)$ is the correct one. It gives an expected score of 90. The expected score of a misspecified interval $(140,160)$ gives an expected score of about 103. (I simulated.) That is, minimizing the score pulls us towards reporting the correct interval. – Stephan Kolassa May 12 '20 at 13:17

Testing for clairvoyance (or performance of a model) where the predictions are intervals

1 Answers1

Linked