1

I am currently training a random forest regressor (scikit learn) on the Titanic dataset.

My question is related to this issue (https://stackoverflow.com/questions/19984957/scikit-predict-default-threshold) on stack overflow.

I noted that I didn't have the same value as in scikit for measures like Precision, Recall, F1-score ... After investigating I noticed that the reason was I considered 0.5 probabilities individuals to be in class 1 while scikit classes them as 0.

So here are my questions :

  • is it better to class 0.5 probabilities individuals in 0 or 1 class ? On titanic for example it can change significantly the value of such measures.
  • would it be legit not to use these ? I do not think so because it bias your results and may tend to improve them.
  • what about classification with more than two classes ? If I have 1/3,1/3,1/3 as probabilities for one individual what should I do ?
  • is there any performance measure emancipated from this problem ?
  • is scikit-learn choosing this 0.5 -> 0 class every time or can it be random / depends on the model selected ?
Scratch
  • 754
  • 2
  • 6
  • 17
  • 2
    @FrankHarrell makes arguments in this thread that bear directly on this question, namely whether such cutoffs are desirable. http://stats.stackexchange.com/questions/65382/adding-weights-for-highly-skewed-data-sets-in-logistic-regression#comment165335_65382 – Sycorax Feb 04 '14 at 14:52
  • yes, my question is not that far from this link. However, I'm not limiting the context to highly unbalanced datasets. Imagine a balanced dataset where there are a lot of 0.5 probabilities for classification (in {0,1}). I need to know what to do with those. It is not that much about the tradeoff but rather how to derive a performance measure on the standard maximum likelihood prediction. – Scratch Feb 04 '14 at 15:10
  • 1
    This may sound dumb but you can altogether avoid this by generating an ensemble with an odd number of trees e.g. ntree=1001 – JEquihua Feb 05 '14 at 15:06
  • This is not dumb but I'm not sure this would be correct as not all the individuals are in each tree because of the bootstrap part... – Scratch Feb 05 '14 at 16:10

0 Answers0