3

I am getting a strange output from sklearn's LogisticRegression, where my trained model classifies all observations as 1s.

In [1]:
logit = LogisticRegression(C=10e9, random_state=42)
model = logit.fit(X_train, y_train)
classes = model.predict(X_test)
probs = model.predict_proba(X_test)

print np.bincount(classes)

Out [1]: 
[   0 2458]

But look at the predicted probabilities: predicted_probability_histogram

How is this possible?

I know that there is another post on this (here), but it does not answer this question. I understand that my classes are not balanced (this uniform classification goes away when I enter the argument class_weights = balanced).

However, I want to understand why sklearn is classifying predicted probabilities of less than 0.5 as a positive event.

Thoughts?

NLR
  • 133
  • 1
  • 5
  • This focuses on sklearn but I think it's really a question about logistic regression so I am voting to leave it open. – Peter Flom Jun 30 '18 at 13:53
  • Hi @PeterFlom, I see your point, but I would argue that it is more about what sklearn does with the output than logistic regression per se. The issue and the answer both center on how sklearn's `predict_proba` returns predictions for both classes. From a regression standpoint, it is curious that the model only predicts 1s, but that is an artifact of class imbalance. – NLR Jun 30 '18 at 17:47

1 Answers1

8

Notice how your plot is symmetric? That's because predict_proba has shape (n_samples, n_classes), so half the data you've plotted is redundant with the other half (since $p_i + (1 - p_i) = 1$).

If you look at probs[:,1] by itself I'm sure it will make sense.

Sycorax
  • 76,417
  • 20
  • 189
  • 313