Confusion matrix, metrics, & joint vs. conditional probabilities

Question

In the binary classification/prediction problem we have unknown labels $y\in\{0,1\}$, which we try to predict using an estimator $\hat{y}$. Commonly the performance of an estimator is summarized using a confusion matrix, and many related performance metrics.

I have not done much work in this area, and reading about things like precision-recall, I can definitely sympathize with things like this!

To the extent I have thought about these issues, I find it much clearer to the think about things in terms of joint/marginal/conditional probabilities. But then things like this make me less certain of my understanding.

So I am looking for an explanation of the relationships between:

Standard confusion-matrix related terminology (see big Table at Wikipedia here)
The various probability distributions of $y$ and $\hat{y}$ (i.e. joint vs. conditional vs. marginal)

Is there something like a "dictionary" to translate between the two descriptions?

Process Note: I have a notion of the answer I am looking for, which I think could be useful to others as well. However I am not fully certain, and could use feedback in any case. So I will be drafting an "answer" below, but hope others more knowledgeable may respond as well. (Also, I do not believe this is a duplicate, but if I missed an existing Q & A please let me know.)

This is a nice question (+1). Have you head about [scoring rules](https://en.wikipedia.org/wiki/Scoring_rule) and in particular *strict proper scoring rules*? I think they will be enlightening for you when it comes into assessing the probability distributions $\hat{y}$. — usεr11852, May 06 '17 at 00:23
@usεr11852 Thanks! I have heard about these (from this site, in fact). Skimmed a couple Tetlock papers, been meaning to read *Superforecasters* :) — GeoMatt22, May 06 '17 at 01:08
Ah... the famous *books-I-have-been-meaning-to-read* list :) If I have read a half of the ones on my list I would probably be an ASA fellow by now. Slightly more seriously: Chambers (2007) "*Proper scoring rules for general decision models*" and Merkle & Steyvers (2013) "*Choosing a strictly proper scoring rule*" are interesting reads. (read them last week :D...) — usεr11852, May 06 '17 at 11:42

GeoMatt22 · Answer 1 · 2016-12-23T06:09:20.737

For predicted labels $\hat{y}$ and true labels $y\in\{0,1\}$, the confusion matrix is given by

\begin{array}{c|c:c|c} & y=0 & y=1 & \\ \hline \hat{y}=0 & \mathrm{TN} & \mathrm{FN} & \hat{\mathrm{N}} \\ \hdashline \hat{y}=1 & \mathrm{FP} & \mathrm{TP} & \hat{\mathrm{P}} \\ \hline & \mathrm{N} & \mathrm{P} & (n_{\mathrm{obs}}) \end{array}

where the entries are counts, $\mathrm{N}$ = "Negative", $\mathrm{P}$ = "Positive", $\mathrm{T}$ = "True", and $\mathrm{F}$ = "False".

The confusion matrix proper is contained within the solid-outlined box, to which I have added the column sums ($\mathrm{N}$,$\mathrm{P}$), column sums ($\hat{\mathrm{N}}$,$\hat{\mathrm{P}}$), and total sum ($n_{\mathrm{obs}}$ = number of paired observations).

The confusion matrix is essentially an empirical estimate of the joint distribution between $\hat{y}$ and $y$, i.e. when the entries are normalized by $n_{\mathrm{obs}}$ we get

\begin{array}{c|c:c|c} & y=0 & y=1 & \\ \hline \hat{y}=0 & p[\sim\!\hat{y},\sim\!y] & p[\sim\!\hat{y},\phantom{\sim\!}y] & p[\sim\!\hat{y}] \\ \hdashline \hat{y}=1 & p[\phantom{\sim}\,\hat{y},\sim\!y] & p[\phantom{\sim}\,\hat{y},\phantom{\sim\!}y] & p[\phantom{\sim}\,\hat{y}] \\ \hline & p[\phantom{\sim\hat{y}}\sim\!y] & p[\phantom{\sim\hat{y},,}\,y] & (1) \end{array} where I have switched to a Boolean-style notation with $\sim$ = "not".

In the margins of the table (outside the box), the normalized row and column sums are now the marginal probabilities.

Within this framework, many of the standard confusion matrix based metrics correspond directly to the various conditional probabilities of the above joint distribution.

If we condition on $\boldsymbol{y}$ the table becomes \begin{array}{|c:c|} \hline p[\sim\!\hat{y}\mid\sim\!y] & p[\sim\!\hat{y}\mid\phantom{\sim\!}y] \\ \hdashline p[\phantom{\sim}\,\hat{y}\mid\sim\!y] & p[\phantom{\sim}\,\hat{y}\mid\phantom{\sim\!}y] \\ \hline \end{array}

where the entries correspond to the metrics \begin{array}{|c:c|} \hline \text{specificity} & \text{miss rate} \\ \hdashline \text{fall-out} & \text{sensitivity (recall)} \\ \hline \end{array} (Note that these metrics can also be referred to by appending "rate" to the corresponding name from the confusion matrix.)

Alternatively, if we condition on $\boldsymbol{\hat{y}}$ the table becomes \begin{array}{|c:c|} \hline p[\sim\!y\mid\sim\!\hat{y}] & p[\phantom{\sim\!}y\mid\sim\!\hat{y}] \\ \hdashline p[\sim\!y\mid\phantom{\sim}\hat{y}] & p[\phantom{\sim\!}y\mid\phantom{\sim}\hat{y}] \\ \hline \end{array}

where the entries correspond to the metrics \begin{array}{|c:c|} \hline \text{negative predictive value} & \text{false omission rate}^* \\ \hdashline \text{false discovery rate} & \text{positive predictive value (precision)} \\ \hline \end{array} (*This one was not in Wikipedia except in their "big table". I was curious why it was the only one of the conditional probabilities not given a special name.)

By asking a question here you should want to here comments or proposed answers of others. You didn't wait very long for that and so you got nothing. Did you do this just to try to teach us something? — Michael R. Chernick, Dec 24 '16 at 18:57
Sorry I reread your question completely and see you are open for answers using your answer as a talking point. — Michael R. Chernick, Dec 24 '16 at 19:04

Confusion matrix, metrics, & joint vs. conditional probabilities

1 Answers1

Linked