Cohen's Kappa Classifier vs. Groundtruth

Question

The user guide on Cohen's kappa of the scikit-learn documentation states the following:

The function cohen_kappa_score computes Cohen’s kappa statistic. This measure is intended to compare labelings by different human annotators, not a classifier versus a ground truth. (emphasis added)

I wonder why it explicitly says that the intention behind the kappa statistic is to measure inter-rater agreement. Are there any practical (or theoretical) concerns related to that when using the Kappa statistic in performance evaluation of a classification model?

Kappa doesn't know if the evaluators or measurers were human or anything else. The point is rather that it doesn't distinguish between them and work with the idea that one evaluation is correct or true and the other a more or less accurate version of the truth. The original intention was to compare people, but the symmetric treatment of inputs remains. In any software implementation, reversing the order of arguments should make no difference to the results. Nor are you asked to specify true or fake (or any other different roles). — Nick Cox, Oct 17 '17 at 00:18

score 6 · Accepted Answer · answered Oct 16 '17 at 21:18

Cohen's Kappa was originally intended to be applied to human classifiers. You can find this described in his article, A coefficient of agreement for nominal scales.

... two psychiatrists independently making a schizophrenic-nonschizophrenic distinction on outpatient clinic admissions might report 82 percent agreement, which sounds pretty good.

But is it? Assume for a moment that instead of carefully interviewing every admission, each psychiatrist classifies 10 percent of the admissions as schizophrenic, but does so blindly, i.e., completely at random. Then the expectation is that they will jointly "diagnose" .10 X .10 = .01 of the sample as schizophrenic and .90 X .90 = .81 as. nonschizophrenic, a total of .82 "agreement," obviously a purely chance result. This is no more impressive than an ESP demonstration of correctly calling the results of coin tosses blindfolded 50 percent of the time!

The example is a bit extreme, but the principle it illustrates is unexceptionable: the proportion or percent of observed agreement between two judges assigning cases to a set of k mutually exclusive, exhaustive categories inevitably contains an expected proportion that is entirely attributable to chance.

Thus was kappa born. It is simply the proportion of agreement corrected for chance.

That said, it has been used by others to evaluate statistical classifers. See this paper as an example: Chance-corrected Classification for Use in Discriminant Analysis: Ecological Applications and Comparison of classification accuracy using Cohen’s Weighted Kappa

Or you can follow this post as well Cohen's kappa in plain English

To your question "Are there any practical (or theoretical) concerns related to that when using the Kappa statistic in performance evaluation of a classification model?" To my knowledge, there has not been research into identifying any concerns in applying an evaluation tool intended for humans to statistical classifiers. However, you can find it is widely used, so you can feel some confidence that there is not an issue with using this tool for non-human classifiers.

If you have two classifiers, you can use the statistic to evaluate their agreement, but should not use it with the "true" values as (it is assumed) the "true" values are factual and not estimates.

Hope this helps.

Cohen's Kappa Classifier vs. Groundtruth

1 Answers1