2

I have a random forest classification model built on balanced positive and negative classes.

I am trying to estimate the number of false positives and false negatives in the new data, in order to select an appropriate threshold. However, I don't have a good estimate for the number of positives in the new data.

I know that N_negatives >> N_positives, so I can estimate false_positives = N * FPR. Is it possible to estimate the number of true positives?

thc
  • 388
  • 2
  • 16
  • 1
    Note that the False Discovery Rate and the False Positive Rate are two very different things. I took the liberty of editing your post. – Stephan Kolassa Nov 15 '17 at 07:27
  • I think you may be confused about the FDR and the FPR. [The FDR is the rate at which true null hypotheses are rejected.](http://engr.case.edu/ray_soumya/mlrg/controlling_fdr_benjamini95.pdf) It is specific to null hypothesis significance testing (NHST) concept. The FPR is the rate at which cases that are truly negative are classified wrongly as positive. It can be used outside NHST. You *could* say that "the FDR is the FPR for null hypotheses", but you have zero NHST content. FDR is meaningless here. – Stephan Kolassa Nov 15 '17 at 07:52
  • I'm using the definition here: https://en.wikipedia.org/wiki/Precision_and_recall#Definition_.28classification_context.29. FDR = `false_positive / predicted_positive`. That's the value I'm interested in. – thc Nov 15 '17 at 17:21

1 Answers1

1

If you don't know which cases in your test set are true positives and negatives, then you cannot say whether a particular classification is true or false (positive or negative).

Therefore, you cannot estimate the False Positive Rate in new data, nor false_positives = N * FPR, either, because you don't know the FPR.

If you truly need this, then you could go back one step, partition your training data (where you do know true positives and negatives - right?) into a training and a test sample, then assess the FPR on the test sample.

I recommend that you take a look at more on why choosing a threshold for hard zero-one classification is a bad idea here and in the linked blog posts by Frank Harrell. In addition, if you have a balanced training sample but an unbalanced test sample, then your training sample differs systematically from the true population you want to apply your model to, which will bias your model. Better to use a representative training sample and use probabilistic predictions.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thanks. I am interested in the false discovery rate (I reverted your edit to my post). I have a good estimate of the false positive rate, because I can estimate that based on my training data, using out of bag error. As you say I don't know the number of real positives in my dataset. However, I don't believe it is impossible to have an estimation, even a very rough estimate per se. E.g., based on the distributions of the posterior probability. – thc Nov 15 '17 at 07:36
  • Also yes, ideally it would be good to have a representative train data based on the real population, but for technical reasons it isn't possible. – thc Nov 15 '17 at 07:41
  • Of course you can use ML algorithm A to predict which test samples are positive (e.g. via Bayesian posterior probabilities), then use this prediction to estimate ML algorithm B's FPR. But that's kind of trying to pull yourself out of the swamp by your own bootstraps, and it's not the *good* kind of bootstrap. (I did think about putting this in.) I don't think this will be helpful, because if it is, then why don't you just use ML algorithm A straight away? – Stephan Kolassa Nov 15 '17 at 07:47