I am currently training a random forest regressor (scikit learn) on the Titanic dataset.
My question is related to this issue (https://stackoverflow.com/questions/19984957/scikit-predict-default-threshold) on stack overflow.
I noted that I didn't have the same value as in scikit for measures like Precision, Recall, F1-score ... After investigating I noticed that the reason was I considered 0.5 probabilities individuals to be in class 1 while scikit classes them as 0.
So here are my questions :
- is it better to class 0.5 probabilities individuals in 0 or 1 class ? On titanic for example it can change significantly the value of such measures.
- would it be legit not to use these ? I do not think so because it bias your results and may tend to improve them.
- what about classification with more than two classes ? If I have 1/3,1/3,1/3 as probabilities for one individual what should I do ?
- is there any performance measure emancipated from this problem ?
- is scikit-learn choosing this 0.5 -> 0 class every time or can it be random / depends on the model selected ?