Predict probabilities for continuous variable

Question

Usually, with a continuous dependent variable, we can apply linear regression and then predict values based on new data.

For instance, defaults on loans: let's say we know an individual will default on his loan, and we want to estimate how long it takes him to default (1 year, 2 years, 3 years... after he took the loan).

With linear regression, we can predict for a new individual that, based on his characteristics, he will default after X years.

But what I'm looking for is a model which will give me probabilities for each of the values.

Here, it would be: for a new individual that we know is going to default, what is the probability he will default after 1 year vs the probability he will default after 2 years...

One possibility would be to consider that the dependent variable is categorical, and regress a logit / probit model to get probabilities.

But 1) there is some loss of information. Multinomial logit does not consider the categories as related. At best, ordered logit will order them. But we still don't take into account the increment is the same between all categories (1 year).

And 2) if we want to consider defaults on more than a few years, the number of categories of the dependent variable quickly increases, which will affect the performance of the predictions.

So if anyone has an idea on how to tackle this problem, I'd like to know your thoughts! I feel like I'm not approaching it right at the moment, and maybe I need another kind of modelisation altogether.

Thank you very much !

score 3 · Accepted Answer · answered Aug 03 '18 at 10:31

3

If you want to predict things like probability of default as a function of time, then you are interested in survival analysis models, so check the questions tagged as survival-analysis.

As about your general question, with binary data we use logistic regression that enables us to predict the probability of success by assuming Bernoulli distribution, with multiple categories we assume multinomial distribution, and for continuous data, we assume an appropriate continuous distribution. In case of linear regression, the probabilistic model behind it assumes normal distribution, so if know the parameters of the distribution, you can estimate the probability densities for a particular outcome, given the estimated parameters. Same with other distributions, so basically the all you need is a probabilistic model.

answered Aug 03 '18 at 10:31

Tim

108,699
20
212
390

Note also ordered (ordinal) logit or probit, although that seems a little artificial in this case. – Nick Cox Aug 03 '18 at 10:44
Thanks a lot for your answers! About survival analysis models, is the probability of survival necessarily decreasing with time? Ideally I'd rather be able to model the peak for example in year 3, in general not necessarily year 1. About linear regression, when you say parameters do you mean the normal distribution parameters (mean and variance) ? Can I get them through the usual packages (in R for instance) or would I have to compute them myself? – ALF Aug 03 '18 at 12:08
@ALF survival analysis models model cumulative probabilities, they do not assume that probability increases over time, but obviously the cumulative probabilities increase (you can extract the non-cumulative probabilities from this). As about R, there's a `survival` package. – Tim Aug 03 '18 at 12:12
@Tim I see,with cumulative probabilities it should correspond to what I want, and indeed it would be easy to extract the non-cumulative probabilities from there. Thanks again! – ALF Aug 03 '18 at 12:31

Predict probabilities for continuous variable

1 Answers1