1

I fit a logistic regression model with an unbalanced population in R.

The problem that I am getting is I have 0.4 for precision and 0.0018 for recall, so I want to modify the threshold in order to get close both indicators (precision and recall)

Do you have any function in R to modify the cutoff? I have seen some work arounds in Python, but the code that I need is in R.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
jfcb
  • 11
  • 1
  • 1
    Questions that are only about software (e.g. error messages, code or packages, etc.) are generally off topic here. If you have a substantive machine learning or statistical question, please edit to clarify. – gung - Reinstate Monica Mar 01 '20 at 16:25
  • The [docs](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/predict.glm) suggest that predictions give probabilities. As such, this question actually has nothing to do with `glm`. I think you may want to ask a question "How do I round a value in [0,1] according to a given cut-off value $x \in [0,1]$ so that anything less than $x$ returns 0 and anything greater than $x$ returns 1?" If you need to do that in `R`, it probably belongs on StackOverflow. – Him Mar 02 '20 at 17:11
  • 1
    Yes, my question is that "How do I round a value in [0,1] according to a given cut-off value x∈[0,1] so that anything less than x returns 0 and anything greater than x returns 1"? I really need that!, THanks – jfcb Mar 03 '20 at 18:54

1 Answers1

1

Don't use thresholds at all.

Don't use precision and recall. Every criticism that applies to accuracy applies equally to precision and recall.

Unbalanced datasets are not a problem if you use appropriate quality measures (i.e., not accuracy, precision or recall).

If you still feel you need to work with thresholds, simply use predicted probabilities with predict(..., type="response") (see ?predict.glm) and compare them with your threshold.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • "if you use appropriate quality measures" Would you suggest anything in particular? Although analysis via a single cut off point may be quite narrow, the set of confusion matrices at every cut off contains a lot of information about the performance of a classifier. For binary classification problems, accuracy, precision and recall are essentially the confusion matrix. – Him Mar 01 '20 at 17:27
  • @Scott: proper scoring rules are the tool of choice, see the first two links. Per both links, I do not believe confusion matrices are useful. They are improper and misleading, and their very simplicity makes them doubly dangerous. – Stephan Kolassa Mar 01 '20 at 17:31
  • Neither of those links seem to suggest anything. They merely repeat your claim that confusion matrices are not useful. – Him Mar 01 '20 at 17:38
  • Indeed, the answer to this question "What are the consequences of deciding to treat a new observation as class 1 vs. 0? Do I then send out a cheap marketing mail to all 1s? Or do I apply an invasive cancer treatment with big side effects?" goes hand-in-hand with the contents of the confusion matrix since it tells you how many people without cancer are getting your invasive cancer treatment. – Him Mar 01 '20 at 18:59
  • Do you have some example code of: "If you still feel you need to work with thresholds, simply use predicted probabilities with predict(..., type="response") (see ?predict.glm) and compare them to your threshold" Thanks a lot in advance – jfcb Mar 01 '20 at 19:38
  • @Scott: have you looked at [my answer](https://stats.stackexchange.com/a/312787/1352) at the second thread I linked? There are multiple paragraphs on scoring rules as alternatives to accuracy, and as I write, the exact same criticisms that apply to accuracy apply equally to precision and recall, and therefore also to the confusion matrix. Also, [the first link](https://stats.stackexchange.com/a/312124/1352) explicitly addresses that very often we will not have *two* possible actions (treating a case as "positive" vs. "negative"), but more (collect more data if we are unsure). – Stephan Kolassa Mar 02 '20 at 09:27
  • @josecorti: `predict.glm(..., type="response")` will give you probabilistic predictions, i.e., numbers between zero and one. You can compare them to a `threshold` value using straightforward comparison operators like ` – Stephan Kolassa Mar 02 '20 at 09:30
  • @Scott: on the shortcomings of the confusion matrix, [here is another example](https://stats.stackexchange.com/a/405049/1352). Yes, I do write about this often, I'll admit it's a bugbear of mine. – Stephan Kolassa Mar 02 '20 at 09:32
  • Many of your criticisms are fair. However, I think that it is often necessary to use the confusion matrix precisely because of the reasons that your answers say not to. Very frequently, we make a decision based on the outcome of a classifier, and the intermediate probability calculation is irrelevant to how the model ends up being used. Arguably, people should take the pseudo-probabilistic output into account somehow, but ML systems being what they are these days, that rarely happens: the machine makes a decision totally independently of any human input. – Him Mar 02 '20 at 14:50
  • It's also worth noting that many of the criticisms of accuracy are of using accuracy *only*. The same apply to using recall and precision *only*, but go away when one considers the entire confusion matrix. Not all criticisms, but one can never address every possible criticism of anything, I suppose. – Him Mar 02 '20 at 15:05