sklearn LogisticRegression only predicts 1, but predict_proba has many values

Question

I am getting a strange output from sklearn's LogisticRegression, where my trained model classifies all observations as 1s.

In [1]:
logit = LogisticRegression(C=10e9, random_state=42)
model = logit.fit(X_train, y_train)
classes = model.predict(X_test)
probs = model.predict_proba(X_test)

print np.bincount(classes)

Out [1]: 
[   0 2458]

But look at the predicted probabilities:

How is this possible?

I know that there is another post on this (here), but it does not answer this question. I understand that my classes are not balanced (this uniform classification goes away when I enter the argument class_weights = balanced).

However, I want to understand why sklearn is classifying predicted probabilities of less than 0.5 as a positive event.

Thoughts?

This focuses on sklearn but I think it's really a question about logistic regression so I am voting to leave it open. — Peter Flom, Jun 30 '18 at 13:53
Hi @PeterFlom, I see your point, but I would argue that it is more about what sklearn does with the output than logistic regression per se. The issue and the answer both center on how sklearn's `predict_proba` returns predictions for both classes. From a regression standpoint, it is curious that the model only predicts 1s, but that is an artifact of class imbalance. — NLR, Jun 30 '18 at 17:47

score 8 · Accepted Answer · answered Jun 29 '18 at 20:18

8

Notice how your plot is symmetric? That's because predict_proba has shape (n_samples, n_classes), so half the data you've plotted is redundant with the other half (since $p_i + (1 - p_i) = 1$).

If you look at probs[:,1] by itself I'm sure it will make sense.

answered Jun 29 '18 at 20:18

Sycorax

76,417
20
189
313

Ah, I see: predict_proba returns Pr(y=1 | X) and Pr(y=0 |X) for each sample. Makes sense. Thanks @Sycorax! – NLR Jun 29 '18 at 20:30
If you've found this helpful, please consider upvoting and/or accepting my answer. – Sycorax Jun 29 '18 at 20:32
I'm a newb on here, so I'm not able to vote yet :/ – NLR Jun 29 '18 at 20:38

sklearn LogisticRegression only predicts 1, but predict_proba has many values

1 Answers1