Cohen's kappa as a classifier "strength" estimator

Question

What are the strengths or weaknesses of using the higher Cohen's kappa score to select a classifier from a pool of candidates?

Back-story:
I had a pool of candidate classifiers (~200 candidate models), and I did the best I could to find the "strongest".

For each classifier I did this:

compute true positive rate (estimated true / actual true)
compute true negative rate (estimated false / actual false)
multiply together to create a score

Then I ranked on scores and picked the one with the highest. It worked out okay, but I like to revisit past problems to find better tools. I'm always in the market for a better analytic tool.

Proposed approach:

If I have this data: $$ \begin{matrix} & Actual \, True & Actual \, False \\ Est \, True & 45 & 16\\ Est \, False & 25 & 14 \end{matrix} $$

and I compute Cohen's kappa of the estimator versus actual I get this:

$$ \kappa = \frac {p_o - p_e} {1 - p_e}$$ where $$ p_o = \frac {45 + 14} {45 + 16 + 25 + 14} = 0.59 $$ and $$ p_{True} = p_{true}|est \cdot p_{true}|actual \\ \, \\ \frac {45+16} {100} \cdot \frac {45+25} {100} = 0.427\\ \, \\ p_{False} = p_{false}|est \cdot p_{false}|actual \\ \, \\ \frac {25+14} {100} \cdot \frac {16+14} {100} = 0.117 $$ therefore $$ p_e = 0.427 + 0.117 = 0.544 $$

so the Cohen's Kappa for this is:

$$ \kappa = \frac {p_o - p_e} {1 - p_e} = \frac {0.59-0.544} {1-0.544} = 0.1008 $$

If I had an alternate, better, estimator that had the following confusion matrix: $$ \begin{matrix} & Actual \, True & Actual \, False \\ Est \, True & 45 & 16\\ Est \, False & 14 & 25 \end{matrix} $$

then the kappa is 0.375.

If these were the two classifiers to pick, I would be wise to prefer the 0.375.

Extended questions:

Does this have a name outside of "using Cohen's kappa"? Is it equivalent to another, more well known and studied method? Is it textbook?
Are there known problems with this approach? What are the weaknesses here?

UPDATE:

I was using the knime "scorer" and noticed that Cohen's kappa is given as a measure of learner performance.

The way in which you evaluate any classifier depends on the subject matter at hand. For example, false positives may be undesirable for a diagnostic test of a non-life-threatening disease if the treatment for a "positive" subject itself is invasive. You should become familiar with Sensitivity, Specificity, PPV, NPV, etc. Then you can decide if you'd like to try to maximize one of these measures, or if Cohen's Kappa is right for you (which it might be). Also, you should know that evaluating your classifiers on train data gives optimistic results that may not reflect performance on unseen data. — klumbard, Sep 14 '17 at 19:18
@klumbard "Also, you should know that evaluating your classifiers on train data gives optimistic results that may not reflect performance on unseen data. " Only if either you used the test data in some way when choosing and tuning your model (which you shouldn't) or if your data set is not representative of future data (which is a separate issue) — David Ernst, Sep 14 '17 at 19:19
The goal is to have enough separation in the common cause variation in the data that the test and training are fundamentally different, and yet you want to have uniform "physics" between the two subsets so that when "physics" are "captured" for one subset (train), they are simultaneuously captured in all. ... This is good feedback. Thank you. — EngrStudent, Sep 17 '17 at 16:44
See related question about [Kappa vs Accuracy](https://stats.stackexchange.com/q/440731/3277). — ttnphns, Jan 03 '20 at 14:26

David Ernst · Accepted Answer · 2017-09-14T19:42:51.970

What you had as performance measure is conceptually similar to an F-measure (also called F1 score). You took one score regarding the positive records and one other score regarding the negative records and then computed the geometrical mean of them. (Technically the square of the geometrical mean, but it doesn't matter for your comparison if you compare algorithms on your score or the square root of your score which would be the geometric mean.) This geometric mean penalizes situations where both measures diverge whereas the arithmetic mean would let one measure compensate for poor performance on the other.

The F-measure takes the harmonic mean of two quantities, which penalizes differences even more than the geometric mean. The F-measure also averages not exactly the same quantities that you look at but precision and recall instead. What you did is unconventional there, but if you know that those two were the quantities of interest for you, why not.

Cohen's $\kappa$ has the advantage that you can use it for multi-class problems in ways that your initial score, F-measure and also ROC AUC cannot be used out of the box. (There are extensions though.)

You can just say that you are using Cohen's $\kappa$ for classification. Max Kuhn does it in his excellent book Applied predictive modelling. The same author also created the caret package in R.
The issue with using $\kappa$ in classification is $p_e$ and how to compute it. In your case, the most prevalent class is the true class with $(45+25)/100$ records. A "classifier" that does nothing but assign all records to the majority class would already be right $70\%$ of the time. Therefore, it is not reasonable to use Cohen's definition of $p_e=0.544$ in your case since a much higher baseline agreement of $p_e=0.7$ can be reached effortlessly. That's the baseline you should compare your observed agreement $p_o$ to. (There are also discussions whether $p_e$ when based on the most prevalent class should be estimated over the training set, test set or the entire data set.)

Some more general remarks:

Only use special performance metrics if you know you have asymmetric costs of misclassification. If any error is as costly as any other, accuracy is the correct metric to compare performance on and it works out of the box for multi-class problems.
If you have asymmetric costs of misclassification but you know all the concrete costs, optimize that objective function instead of any other metric.
When comparing 200 algorithms (more likely many variants and differing parameters per algorithm for much fewer than 200 actual algorithms), be sure to compare not only point estimators of performance but some statistical test of performance difference and deal adequately with the problems arising from multiple comparisons (which is not at all easy with 200 of them).

It is an exceptional point that there are more than "pure random" strategies that count as "trivial". [Good ol' rock - Bart](https://twitter.com/simpsons_tweets/status/517423167506509824) — EngrStudent, Sep 15 '17 at 13:38

Haitao Du · Answer 2 · 2017-09-14T20:31:58.313

3

A related post can be found here

Cohen's kappa in plain English

One advantage of using Kappa is it can handle imbalanced data well. For example, in fraud detection, if we predict all transactions are not fraud, accuracy would be very high, and Kappa is low.
One disadvantage of using Kappa would be it is still one number that cannot reflect the complicated business needs. For example, for different precision (how much true / false positive we make) and recall (how much real fraud w catch), we can have same Kappa.

edited Sep 14 '17 at 20:31

answered Sep 14 '17 at 19:14

Haitao Du

32,885
17
118
213

I think in terms of widget manufacturing, not fraud, so the questions are "what are acceptable tradeoffs for escapee rate vs. cost". Nice answer. – EngrStudent Sep 15 '17 at 13:33

Cohen's kappa as a classifier "strength" estimator

2 Answers2

Linked