How to determine whether a classifier is significantly better than random guessing?

Question

I have a classifier Y that selects between three categories: A, B and C.

I need to be able to quantitatively prove that my model is better (and by how much?), than a random classifier R that randomly picks between categories A, B and C.

I intend to proceed as follows:

Generate classifications using classifier R
Generate a confusion matrix for output of classifier R
Generate classifications using classifier Y
Generate a confusion matrix for output of classifier Y

However, having generated the two confusion matrices above, I'm not sure how to use them to solve my problem.

The "intuition" behind using the confusion matrices is that I can "visually" check and "compare" the sensitivity, specificity etc between the models etc.

I would like to be able to use the confusion matrices (if possible) to do some test the null hypothesis that the BookMaker Informedness of Y is no better than R

Can anyone help with how I can test this hypothesis, given data from the two confusion matrices?

score 3 · Accepted Answer · answered Sep 22 '18 at 14:00

3

If you care mostly about accuracy (which I deduce from the fact that you are discussing confusion matrices), a meaningful benchmark is not a random classifier, but one that classifies everything as the most common class in the training set. It will always have an accuracy at least as high as random classification (though not sensitivity and specificity).
This is one reason why accuracy is not a good evaluation measure for classifiers. Neither are confusion matrices, nor any measures derived from them, like sensitivity or specificity. I give arguments here.
Instead, use probabilistic predictions and assess these using proper scoring rules. See the link above for more information.
To assess whether one classifier is better than a "benchmark" one (e.g., one that always assigns a probability of 1 to the majority class and a probability of 0 to all others), you need to derive the distribution of the score of the benchmark. You can get this by bootstrapping: resample the test set, apply the benchmark to the resampled set, note the score. Repeat this many times. You now have the distribution you are looking for. Check how many of these resampled scores are smaller (since scores are better when smaller) than the score of your candidate classifier.
If you insist on using confusion matrices and summarizing them using a single number like $J$, you can do the same: resample the test set, apply your benchmark (random classification, or, as I propose above, classify-everything-into-the-majority-class) to get a confusion matrix, calculate $J$ (or whatever). Do this many times. You now have the distribution of $J$ for your benchmark. Check how many of these resampled $J$s exceed the $J$ of your candidate method on the entire test set.

Good luck!

answered Sep 22 '18 at 14:00

Stephan Kolassa

95,027
13
197
357

Thanks for your answer. Your answer does pose me with some problems however - not least of which is whether I have represented my problem correctly. I have decomposed the problem into (what I thought) was its core/central model - but now I'm not so sure it is the correct representation of my problem - given some of your suggestions - particularly the point you make in **1** (_a meaningful benchmark is not a random classifier, but one that classifies everything as the most common class in the training set._). This is clearly not applicable in my case. – Homunculus Reticulli Sep 22 '18 at 14:45
My "classifier" is actually a _**signal generator**_ which generates one of three possible outputs: **LEFT**, **CENTRE** or **RIGHT**. The signal generator is "fed" temporal data, and it outputs one of the three possible categories (of action). What I'm trying to do is to find how much better this "signal generator" is than one which randomly selected a label from one of the three possible outputs of **LEFT**, **CENTRE** or **RIGHT**. – Homunculus Reticulli Sep 22 '18 at 14:47
Is one of your three categories more commonly the "correct" one than the others? If so, a good benchmark would be a signal generator that always outputs this one. It will have a higher accuracy, i.e., more often output the "correct" one, than a purely random generator. – Stephan Kolassa Sep 23 '18 at 14:16
No, there is no real way of determining which of the actions (categories) would be selected next. Although there is a kind of auto-regressive component of the model - in the sense that once an action (category) has been selected, it is more likely to be selected the next time round. (Yet) another way of looking at the categories would be the states of an FSM, where if the "state/category/action" predicted by my model matches the real life observation, then my model is scored "correct", else "wrong". I'm not sure if this viewpoint helps or not or further confuses you. – Homunculus Reticulli Sep 24 '18 at 06:58
Essentially, I want to know if there is a quantitative way* to determine whether my model is any better than "selections" of a mad man (i.e. a totally random selection between the available categories/actions). *(if even the "quantitative way" means **derivation from first principles** - although I'm not a statistician and would need help with any such derivation) – Homunculus Reticulli Sep 24 '18 at 07:01

How to determine whether a classifier is significantly better than random guessing?

1 Answers1

Linked