1

I'm working with a dataset of bank loans, trying to predict which loans are going to default based on some pre-loan-subscription features (for instance, what's the credit grade of the borrower, or the amount of the loan, or the borrower's annual income...).

There are roughly 800.000 data points (one for each loan) and about $7\%$ of them are in default (which is a boolean True/False value).

I'm using regression algorithms to output a probability of default for each loan. For instance, given its features values, a particular loan will be assigned a probability of $p = 0.234$ to default. This works pretty well because, for instance, of all loans whose predictions satisfy $p \in [0.20, 0.30[$ in the test set, roughly $25\%$ of them are effectively in default, and so on.

The problem I'm facing is to find a suitable error function to estimate the algorithms' accuracy. Currently I'm using mean absolute error (MAE) which has the following issue: as most loans were not defaulted, if I arbitrarily assign a probability of $p = 0.0$ for every loan, then the mean absolute error will end up being very low, because obviously $p = 0.0$ will be the perfect prediction value for $93\%$ of loans.

What would then be a suitable error function so that its lowest is not when all predictions are arbitrarily set as low as possible, but accurately reflects the quality of the predictions?

To give an illustration of the algorithms output, here are some results, showing the amount of loans predicted for each estimated probability range, and the corresponding proportion of those who have effectively defaulted.

enter image description here

Jivan
  • 407
  • 2
  • 12
  • 1
    See the answer here, it should be helpful: http://stats.stackexchange.com/questions/222558/classification-evaluation-metrics-for-highly-imbalanced-data – Ujjwal Kumar Jan 12 '17 at 11:53
  • 1
    BTW, you shouldn't use salary as a predictor, because you don't want to model the "capability", but the "intention" of the customer. – Ujjwal Kumar Jan 12 '17 at 11:54
  • 1
    @UjjwalKumar Yet salary and intention are most likely linked. Therefore salary is relevant input – Nikolas Rieble Jan 12 '17 at 12:15
  • @UjjwalKumar the link is indeed related - however it concerns classification exclusively - I'm using regression and therefore can't apply the recommendations of the linked answer – Jivan Jan 12 '17 at 12:57
  • @UjjwalKumar besides, why do you say I should model intention and not capability? – Jivan Jan 12 '17 at 12:58
  • 1
    Refer to the part of answers where they talk about AUC, and TPR, Recall at different quantiles of probability-scores. Ideally your high probability cases should contains all defaulters, and as you go lower, more and more non-defaulters would creep-in. A model which separates TP and TN perfectly has an AUC of 1, any time a TN appears higher in probability than a TP, AUC falls. – Ujjwal Kumar Jan 12 '17 at 13:12
  • Let's say you've trained a regression model. It has a positive coefficient for salary => higher the salary (absolute capability), higher the chances of paying-up. Now is that really the case? Shouldn't we instead try to work with regressors like loan-amount/salary? I think the latter relates more to the business objective than former. Also dumping all permutations of variables wouldn't fare well with regression models, due to collinearity issues. – Ujjwal Kumar Jan 12 '17 at 13:18
  • @UjjwalKumar interesting, but how can I compute a TPR or Recall for a particular probability quantile, since the predictions are not "True/False" but, say "40% chance of being True"? – Jivan Jan 13 '17 at 14:13
  • The probability of being true is a continuous quantity. Treat that as a score, you can calculate which percentile of all such scores does a specific score lie in. I'm assuming you have supervised training data – Ujjwal Kumar Jan 13 '17 at 14:39
  • I think your model would be heavily biased against defaulted loans. Essentially if in the test set the good/defaulted loans are 50/50, you'll get a 50% error as all defaulted will be classified as good. Try resampling from your dataset maybe so that the proportion of defaults is higher. 7% of 800K is still not too bad. – Alex Jan 13 '17 at 22:26

0 Answers0