Logistic regression - labeling outcome by confidence of classification

Question

We have trained our logistic regression model to classify candidates attending interviews as 'pursue' or 'fail' (two possible outcome)

Now as a post prediction step, we are planning to categorise the candidates as strong/mediocre/weak based on the model's probability score.

What is the correct way to categorise the candidates ?

What is the reason for modifying your training data? Why would predicting three outcomes be better? — Michael R. Chernick, Jun 25 '18 at 14:49
The intention of the prediction labels is to help non technical users to prioritise the candidates on further evaluation. Labels such as strong/mediocre/weak seems to connect better than probability scores for non technical users (based on user feedback) — Deepan Subramani, Jun 25 '18 at 15:06
If all the training data is labelled that way, you can use ordinal logistic regression. See https://stats.stackexchange.com/questions/94581/plot-and-interpret-ordinal-logistic-regression — kjetil b halvorsen, Jun 26 '18 at 14:06
@kjetilbhalvorsen the training data has just two possible outcomes (pursue or fail) — Deepan Subramani, Jun 26 '18 at 14:25
OK, that way this is really not a statistical question, you want to define three (or more) subintervals dividing $[0,1]$, and report which subinterval each candidate falls, and not the predicted probability. You could have a look at the distribution of predicted probabilities in the two groups, and look at how to divide the scale to have groups of "mostly pursue", "mostly fail", and a mixed group. Just maybe. — kjetil b halvorsen, Jun 26 '18 at 14:30

score 2 · Answer 1 · answered Jun 25 '18 at 14:33

2

Why not use more thresholds? Like:

If probability < 0.25 then class = "weak"
If 0.25 <= probability <= 0.75 then class = "mediocre"
If probability > 0.75 then class = "strong"

But remember that if your interest is to predict correctly (in this case classify) new observations, you won't be able to make a comparison between the truth (2 classes) and the predictions (3 classes).

The predictions labels must always be the same as the truth labels in a classification problem.

If you accept the fact that your model is not perfect, you can use still the probabilities estimated as a "score" for each observation. And use the 3-class definition above. But you won't be able to say, for example that the model as an accuracy of 80%, because of the different number of labels.

answered Jun 25 '18 at 14:33

RLave

590
2
8

The intention of the prediction labels is to help non technical users to prioritise the candidates on further evaluation, hence we see label assignment as a post processing step after prediction. Also are you aware of any method to select most optimal thresholds (instead of arbitrary values like 0.25 and 0.75) ? – Deepan Subramani Jun 25 '18 at 14:59
1

I suggest to use a simple color scale then. For example using Excel, you can define a rule where if a value is close to zero the cell is colored in white, and the more the values goes from 0 to 1, the more red the cell will be. Doing so, non tecnical users, can easily pick the candidates highlighted by the red color. – RLave Jun 25 '18 at 15:01
1

If instead you seek something with thresholds you can start using the quantiles of the distribution of the probabilities you obtain from the model [link](https://en.wikipedia.org/wiki/Quantile) Edit: an example of conditional formatting with Excel [link](https://www.excel-easy.com/data-analysis/conditional-formatting.html) – RLave Jun 25 '18 at 15:03
2

If you can quantify the costs and benifits of the various errors and successes in hiring decisions (ie false positives etc) you can use this to set thresholds. Of course as you have no information on the job performance of those you did not hire, this is tricky in your case. – Matthew Drury Jun 26 '18 at 14:38

Logistic regression - labeling outcome by confidence of classification

1 Answers1