In a credit risk model, the probability of defaulting 18 months in the future is estimated as a function of a set of characteristics $x$. We index individuals with $i$ and periods with $t$.
An outcome variable $y^i_t$ is constructed so that:
$y^i_t = 1$ if $Default^i_{t+18m} = 1$
$y^i_t = 0$ if $Default^i_{t+18m} = 0$
Then the regression has the form: $logit(y^i_t) = \beta_0 + \beta_1 x^i_t +...$
However, due to very few observations data from 60 months is aggregated and the regression is ran on the aggregated sample. My main concern is that in the new sample there will be observations for the same clients at period $t$ and $t+18$ so that:
$logit(y^i_t) = \beta_0 + \beta_1 x^i_t +...$
$logit(y^i_{t+18}) = \beta_0 + \beta_1 x^i_{t+18} +...$
Could this be a cause of endogeneity and other problems?
For example, since $y^i_t = Default^i_{t+18}$ , If the default in any period depends on the characteristics for that period so that: $Default^i_{t+18} = f(x^i_{t+18})$ then we would be estimating
$f(x^i_{t+18}) = \beta_0 + \beta_1 x^i_t +...$
Furthermore, the default flag is very correlated across time, i.e. once an individual defaults at $t$ it will also be defaulted at $t+18$ so that:
if $Default^i_t = 1$ then $Default^i_{t+18m} = 1$
Any reference to relevant literature will be highly appreciated. Thanks!