I am using the caret package to perform predictive modeling on a binary target variable. The outcome is very unbalanced so it is suggested to use the Kappa statistics to evaluate the binary classifier. I am trying to evaluate the performance of various predictive models on a hold out dataset where i have the score of the models (estimated probabilities) and the actual observation (N/Y). AFAIK, the Kappa statistic is $K=\frac{O-E}{1-E}$, being O and E the observed and expected (no information) accuracy. $E$ is the share of the majority class in my case 1 - mean(outcome). My questions are: How can I estimate O? At this regard, I suppose I shall define a cutoff to allocate the observation to the category N/Y. Is 0.5 the right cutoff? Is K unsensitive to the cutoff? Is there an R function that does this?
1 Answers
For a binary classification task, kappa equals: $$\kappa=\frac{p_o-p_c}{1-p_c}$$ The values of $p_o$ and $p_c$ can be calculated from a contingency table as below, where $L$ is the trusted label and $P$ is the predicted value. The cells $a$ through $d$ equal counts of objects with different combinations of $L$ and $P$. $$ \begin{array}{|l|c|c|} \hline & L=1 & L=0 \\ \hline P=1& a & b \\ P=0& c & d \\ \hline \end{array} $$
$$n=a+b+c+d$$
Observed agreement is the proportion of objects where the predicted value matches the trusted label.
$$p_o=\frac{a+d}{n}$$
Finally, chance agreement is estimated (for Cohen's kappa) using Bayes' rule.
$$p_c=\bigg(\frac{a+b}{n}\bigg)\bigg(\frac{a+c}{n}\bigg)+\bigg(\frac{c+d}{n}\bigg)\bigg(\frac{b+d}{n}\bigg)$$
Since your algorithm outputs continuous scores rather than discrete predictions, you will need to dichotomize the scores using a threshold (i.e., cutoff value). You can try different threshold values, although most algorithms will be optimized with one in mind. For instance, SVMs usually optimize such that a threshold of $0$ distance to the class-separating hyperplane will be best within the training set. I would guess that a threshold of $0.5$ would work best if the output scores are probabilities. If you want to visualize the trade-offs inherent to using different threshold values, you can generate a receiver-operating characteristic (ROC) curve or cost curve. However, to calculate $\kappa$ you will need to select a specific threshold.

- 3,922
- 1
- 13
- 36