I'd like to know how to compare model predictions with binary data in the following example, and be pointed to more on the subject.
Specific Example - Comparing weathermen:
It either rains or doesn't rain each day; the ensemble probability is 0.5.
There are two weathermen. The lazy one (weatherman A) just says there's a 50% change of rain every day, but the hard working one (weatherman B) always gives it either an 80% or 20% chance of rain.
Weatherman A and B both are correct over a long period of time
Weatherman B's confusion matrix:
$\begin{array}{c | c c} & \textrm{shine} & \textrm{rain} \\ \hline \textrm{shine} & 0.4 & 0.1 \\ \textrm{rain} & 0.1 & 0.4 \end{array}$
Question(s)
It's clear that weatherman B is better since his predictions are actually useful, but how would one mathematically justify weatherman B is better?
One add hoc metric I've come up with is
$ 1-\left( P(\textrm{shine | shine}) + P(\textrm{rain|rain}) \right)$
but this metric would break down in a place such as Arizona where it's almost always "shine"