Logistic regression - Sample period longer than outcome period

Question

In a credit risk model, the probability of defaulting 18 months in the future is estimated as a function of a set of characteristics $x$. We index individuals with $i$ and periods with $t$.

An outcome variable $y^i_t$ is constructed so that:

$y^i_t = 1$ if $Default^i_{t+18m} = 1$

$y^i_t = 0$ if $Default^i_{t+18m} = 0$

Then the regression has the form: $logit(y^i_t) = \beta_0 + \beta_1 x^i_t +...$

However, due to very few observations data from 60 months is aggregated and the regression is ran on the aggregated sample. My main concern is that in the new sample there will be observations for the same clients at period $t$ and $t+18$ so that:

$logit(y^i_t) = \beta_0 + \beta_1 x^i_t +...$

$logit(y^i_{t+18}) = \beta_0 + \beta_1 x^i_{t+18} +...$

Could this be a cause of endogeneity and other problems?

For example, since $y^i_t = Default^i_{t+18}$ , If the default in any period depends on the characteristics for that period so that: $Default^i_{t+18} = f(x^i_{t+18})$ then we would be estimating

$f(x^i_{t+18}) = \beta_0 + \beta_1 x^i_t +...$

Furthermore, the default flag is very correlated across time, i.e. once an individual defaults at $t$ it will also be defaulted at $t+18$ so that:

if $Default^i_t = 1$ then $Default^i_{t+18m} = 1$

Any reference to relevant literature will be highly appreciated. Thanks!

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

It seems that the default, if it happens, happens only once for an individual, at a particular time. This sounds like it would be better handled by a survival model than by a logistic model.

For a survival model, your outcome variable is a table with both an outcome indicator (as you have already produced) and an associated time for each individual, representing either the time of the default (for $y=1$) or the last time at which it was known that there was no default (for $y=0$). Survival models can take time-varying predictors into account and can even take advantage of data available for individuals for which the last data were taken before your 18-month time point. If 18 months is particularly important for you, you can then use a survival model to make predictions about outcome specifically at 18 months.

This recent question and detailed answer go into more discussion about the advantages of survival models over logistic models for this type of study. Statistical software packages, like the survival and rms packages in R, provide tools for doing survival analysis. Setting up the analysis for time-varying predictors needs some careful attention but is relatively straightforward.

Logistic regression - Sample period longer than outcome period

1 Answers1