10

I am running an analysis on the probability of loan default using logistic regression and random forests.

When I use logistic regression, the prediction is always all '1' (which means good loan). I have never seen this before, and do not know where to start in terms of trying to sort out the issue. There are 22 columns with 600K rows. When I decrease the # of columns I get the same result with logistic regression.

Why could the logistic regression be so wrong?

**Actual from the data**

0 :   41932

1 :   573426

**Logistic regression output** 

prediction for 1 when actually 0: 41932
prediction for 1 when actually 1:573426

A**s you can see, it always predicts a 1**


**Random forests does better:**

actual 0, pred 0 : 38800 
actual 1, pred 0 : 27 
actual 0, pred 1 : 3132
actual 1, pred 1 : 573399
Ferdi
  • 4,882
  • 7
  • 42
  • 62
ivan7707
  • 223
  • 1
  • 2
  • 7
  • 5
    This doesn't make a sense. Logit will not predict exactly 0. It may predict a low value which you interpreted as 0. So, the problem _could_ be due to the threshold, not just the model itself – Aksakal Aug 26 '15 at 20:37
  • @Aksakal, I am using the scikit learn .predict method. [predict class labels for samples in X](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict) – ivan7707 Aug 31 '15 at 19:43
  • Are you familiar with ROC curves? You can extract the predicted probabilities, then play with the threshold to classify the data yourself. The threshold is your trade-off lever between identifying either defaults or non-defaults. – Aksakal Aug 31 '15 at 19:48
  • Yes, I am using ROC curves and AUC to look at the output from the models I am using. – ivan7707 Aug 31 '15 at 20:06
  • 1
    See my answer below, but also you can use ROC to find the sweet spot in your classifier setting for logit between sensitivity and specificity – Aksakal Aug 31 '15 at 20:15
  • 4
    Dont use `predict` in sklearn on a probability model, it's useless. ALWAYS use `predict_proba`. – Matthew Drury Sep 07 '16 at 21:42

6 Answers6

16

The short answer is that logistic regression is for estimating probabilities, nothing more or less. You can estimate probabilities no matter how imbalanced $Y$ is. ROC curves and some of the other measures given in the discussion don't help. If you need to make a decision or take an action you apply the loss/utility/cost function to the predicted risk and choose the action that optimizes the expected utility. It seems that a lot of machine learning users are not really understanding risks and optimum decisions.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 2
    (+1) Yes, the question is "are you solving a classification problem, or are you solving a decision-support problem?". – GeoMatt22 Sep 08 '16 at 14:54
  • 1
    I'm uncertain about that. Estimation of probabilities is a great end result. And note that the majority of "classification" problems are better addressed using optimal Bayes decisions. Other than visual and audio pattern recognition, most problems where classification methods are applied would be better addressed with direct probability estimation. – Frank Harrell Sep 08 '16 at 15:17
  • @FrankHarrell Is it correct that interpreting the output as probabilities requires a design that allows such an interpretation (cohort). And if we don't have such a design then we have to make a decision based on the "risk scores". Further, although there is literature discussing this in the non-calibrated setting, this is not that common in practice. Is this correct? – julieth Sep 12 '16 at 02:31
  • 1
    Please describe how the sampling used to assemble the dataset used for model development differs from the customers to whom you will apply the predictions. – Frank Harrell Sep 12 '16 at 12:17
  • For example, case-control sampling for which target prevalence is unknown. Or moderately-sized convenience samples. – julieth Sep 13 '16 at 01:06
  • I know the ways this can happen. I was asking you about the particular sampling design **you** are dealing with that makes you worry about a model intercept adjustment. – Frank Harrell Sep 13 '16 at 01:49
  • Sorry I misinterpreted. I often encounter a convenience sample collected from one site (a clinic). I don't expected a calibrated model even for this site and not for sites in general. Thanks for sharing your insights. – julieth Sep 13 '16 at 02:17
  • If your model development sample is created in the same way as the future sample you may still be OK with regard to calibration. – Frank Harrell Sep 13 '16 at 03:18
  • Can you provide a reference to learn more about your comment? – Frank Oct 09 '20 at 01:15
5

Well, it does make sense that your model predicts always 1. Have a look at your data set: it is severly imbalanced in favor of your positive class. The negative class makes up only ~7% of your data. Try re-balancing your training set or use a cost-sensitive algorithm.

JimBoy
  • 1,006
  • 8
  • 15
  • thanks for the input. Is there a rule of thumb for what is acceptable for unbalanced data, or good sources for how to re-balance that you could suggest? – ivan7707 Aug 26 '15 at 20:48
  • Unfortunately, there is no rule on to how to pick an algorithm but the "no free lunch theorem". In your particular case I would go with Ross Quinlan's C5.0 package, first. Then you could experiment with different costs and sampling techniques like up- and downsampling, SMOTE etc. In addition, Max Kuhn's site offers a nice summary of established algorithms. – JimBoy Aug 26 '15 at 21:09
  • 7
    (+1) In the absence of a cost function there seems to be no reason to use logistic regression as a *classifier*: you have the predicted probabilities & can use a proper scoring rule to assess your model's performance. See e.g.[What's the measure to assess the binary classification accuracy for imbalanced data?](http://stats.stackexchange.com/q/163221/17230). Imbalance is not a problem per se: see [Does down-sampling change logistic regression coefficients?](http://stats.stackexchange.com/q/67903/17230). – Scortchi - Reinstate Monica Aug 27 '15 at 16:04
  • @Scortchi, thanks for the links and the idea of using models with costs. I was able to find this paper [link](http://www.csd.uwo.ca/faculty/ling/papers/cost_sensitive.pdf) which gets me going in the right direction. – ivan7707 Aug 31 '15 at 16:29
  • 1
    No, it doesn't make a sense that his model predicts always 1s, because 7% is a rather high rate of default and logit is used widely in loan defaults. Consider AAA rated loans which default at 0.1% annually. His are basically junk loans. – Aksakal Aug 31 '15 at 19:49
3

If the problem is indeed the imbalance between the classes, I would simply start by balancing the class weights:

log_reg = LogisticRegression(class_weight = 'balanced')

This parameter setting means that the penalties for false predictions in the loss function will be weighted with inverse proportions to the frequencies of the classes. This can solve the problem you describe.

Tal Yifat
  • 31
  • 2
  • It is not clear to me that you have pinpointed the problem. I think Matthew Drury hit on the problem which had to do with the use of sklearn. – Michael R. Chernick Nov 21 '17 at 20:34
  • Michael may be right, but this did solve my problem and now my model is predicting 1's when it was not before! – embulldogs99 Jun 10 '21 at 21:48
  • @embulldogs99, My model has also started predicting both classes now, but the accuracy is reduced to even lower than when all the observations were assigned to the same class! – Martund Sep 03 '21 at 05:28
2

When you classify using logit, this is what happens.

The logit predicts the probability of default (PD) of a loan, which is a number between 0 and 1. Next, you set a threshold D, such that you mark a loan to default if PD>D, and mark it as non-default if PD

Naturally, in a typical loan population PD<<1. So, in your case 7% is rather high probability of it's one year data (PDs are normally reported on annual basis). If this is multi year data, then we're talking about so called cumulative PD, in this case cumPD=7% is not a high number for 10 years of data, for instance. Hence, by any standards, I wouldn't say that your data set is problematic. I'd describe it at least typical for loan default data, if not great (in the sense that you have relative large number of defaults).

Now, suppose that your model predicting the following three levels of PD:

  • 0.1 (563,426)
  • 0.5 (20,000)
  • 0.9 (31,932)

Suppose also that the actual defaults for these groups were:

  • 0
  • 10,000
  • 31,932

Now you can set D to different values and see how the matrix changes. Let's use D = 0.4 first:

  • Actual default, predict non-default: 0
  • Actual default, predict default: 41,932
  • Actual non-default, predict non-default: 563,426
  • Actual non-default, predict default: 10,000

If you set D = 0.6:

  • Actual default, predict non-default: 31,932
  • Actual default, predict default: 10,000
  • Actual non-default, predict non-default: 573,426
  • Actual non-default, predict default: 0

If you set D = 0.99:

  • Actual default, predict non-default: 41,932
  • Actual default, predict default: 0
  • Actual non-default, predict non-default: 573,426
  • Actual non-default, predict default: 0

The last case is what you see in your model results. In this case I'm emphasizing the threshold D for a classifier. A simple in change in D may improve certain characteristics of your forecast. Note, that in all three cases the predicted PD remained the same, only the threshold D has changed.

It is also possible that your logit regression itself is crappy, of course. So, in this case you have at least two variables: the logit spec and the threshold. Both impact your forecast power.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • You do realize that your'e proposing a technique to deal with imbalanced data, do you? Therefore, you're admitting the effect of the smaller class on the prediction accuracy. In addition, you're proposing a technique that the original model isn't using at all. You can't just change the circumstances to your liking and then make up some statement as you go along. – JimBoy Aug 31 '15 at 20:30
  • In loan default analysis/forecasting the data is always "imbalanced" in this sense. It's the normal state of affairs. – Aksakal Aug 31 '15 at 20:46
  • This may be as it is. Nonetheless, you should have a look at what Max Kuhn describes as the "no information rate", which is nothing else than the largest class in the data set. So, have a look at the table Ivan provided again. The results make perfect sense for the model he used. That you can actually optimize those results with different techniques is another question and entirely possible. – JimBoy Aug 31 '15 at 21:24
  • @JimBoy, I saw his table, and seen many more like that. His is rather simple, we usually deal with loan delinquency data, where the states are all the way from Current to 30 days past due, 60, 90 .... through Default and Closed. In a good portfolio you can have 95% loans in Current (clean) state, and only 1% in Default. People use mulltinomial logit for this kind of stuff all the time in the industry. – Aksakal Aug 31 '15 at 21:31
  • @Aksakal, I'll have to do more reading on changing the threshold, as I have read a lot about how it is mathematically incorrect to change it for logistic regression. On another note, what did you mean by 'it is possible that your logit regression itself is crappy'? – ivan7707 Sep 01 '15 at 15:23
  • @ivan7707, I meant that the model specification may be bad. For instance, your variables may not capture the drivers of default. There could be missing variables etc. All I meant was that the fact that you use logit doesn't mean that logit model is specified well or that your variables describe the behavior of a borrower. You may have 100 variables, but miss one that matters, e.g. bankruptcy indicator. No matter what is your FICO or debt-to-income, if you're in banrkuptcy the default rate is going to be high etc. – Aksakal Sep 01 '15 at 17:32
  • @Aksakal, thanks for the clarification. I am actually using the Lending Club dataset in case you are interested. So far, I am not coming up with anything useful from it though. – ivan7707 Sep 01 '15 at 18:06
  • @ivan7707, I could guess that you are :) Everyone's on this data set trying to come up with strategies for trading notes on the secondary market – Aksakal Sep 01 '15 at 19:20
  • @Aksakal It doesn't matter if the examples in the data set are commonly distributed for the problem domain in question. My statement holds truth, nonetheless. Furthermore, if you change the decision boundry/cut-off, you will have a hard time interpreting the probabilities you generated. Therefore, you could just disregard them in the first place and use a more fitting algorithm instead. This shouldn't be read that you can't use logistic regression at all, but you will likely get a better model performance by using something else. – JimBoy Sep 01 '15 at 19:40
  • @JimBoy, probabilities do not change because you fit logit into data, and the threshold does not enter the equation at all. – Aksakal Sep 01 '15 at 20:01
  • @Aksakal I never said that. I said that you can't just simply interpret the probabilieties anymore if you change the cut-off. You can't just disregard this step to your probabilities as nothing happend! In addition, the normalization of the output values to a range of [0,1] tends to clutter the cases of the minority class very tight together. As a result, you loose some of the differential power to distinguish between the classes. – JimBoy Sep 01 '15 at 20:23
0

Well, without more information its hard to say, but by the definition of logistic regression you are saturating based on the fitted data. So in the equation the e^-t term is going to 0. So the first place to look would be to see what the actual coefficients are.

This could also be due to poorly scaled variables. There might be an issue where one of the columns is huge in numerical value compared to others that is causing it mess up.

Tim Felty
  • 9
  • 2
  • @ Tim Felty, Thanks for the response. Can you please expand on what I would be looking for regarding the coefficients and how this relates to saturation (or point me to a resource to read)? Also, I was under the impression that poorly scaled variables would not have a negative effect on the logistic regression. [link(]http://stats.stackexchange.com/questions/18916/how-do-i-handle-predictor-variables-from-different-distributions-in-logistic-reg ) – ivan7707 Aug 26 '15 at 20:34
0

You may use SMOTE to balance the unbalanced dataset. A good paper for reference is:

Lifeng Zhou, Hong Wang, Loan Default Prediction on Large Imbalanced Data Using Random Forests, TELKOMNIKA Indonesian Journal of Electrical Engineering, Vol.10, No.6, October 2012, pp. 1519~1525, link.

  • Could you add a full citation/reference (including author, date, publisher etc) as you would in an academic paper? This would make it easier for future readers to track it down if the link stopped working – Silverfish Sep 07 '16 at 22:24