0

First off, I am aware that there are some problems with stepwise regression as for instance described here ;) I am saying this to avoid that the discussion goes in the direction of stepwise being an appropriate technique or not.

Let me now describe my problem.

Financial institutions have to estimate customers' default risk; i.e. the probability that a customer will not pay back his debt in full. Typically, this is done using logistic regression.

When there is a lot of (internal) information about customers not paying back their debt in full, the target variable, Y, is a binary variable; e.g. Y = 1 customer did not pay debt back in full and Y = 0 customer did pay debt back in full.

When there is no or hardly any (internal) data about customers not paying back their debt in full, the target variable, Y, can be the rank of an internal default risk rating denoting default risk/credit worthiness. E.g. a financial institution could use ratings similar to S&P's like AAA, AA, A, BBB, ... which conveys the default risk ranks 1, 2, 3, 4, ... These ranks are then the Y's. In that case, we are in an ordinal logistic regression set-up.

In the ordinal logistic regression case, some financial institutions proceed as follows:

  1. Estimate Pr(Y <= j|X) = alpha_j + X' * beta, where j is the rank of the rating, alpha_j is a rating specific intercept, beta is a column vector of coefficients and X is a matrix of covariates.
  2. Drop the alpha_j, which is rating specific and use X' * beta as a scoring function. The resulting customer scores reflect a default risk ranking.
  3. Determine a mapping function that maps the scores to the ratings.

The purpose of the scoring function is thus to properly rank customers in terms of default risk. In this context, I was wondering when one is trying to select the covariates for the model, via stepwise regression, what a proper stopping rule would be.

I am currently using the fastbw() function from the rms package. Initially I used the AIC as the stopping criteria but I am wondering whether this appropriate. The AIC is based on the likelihood function which measures the goodness of fit rather than the model's ranking capability. Would a p-value based stopping rule be more appropriate?

Edit: if a p-value based stopping rules is not appropriate as one of the commenters below suggests, what would be the best stopping rule knowing that only ranking is important?

koteletje
  • 153
  • 7
  • 3
    *"some problems"* ! ! That's putting it mildly. As for *"Would a p-value based stopping rule be more appropriate?"* , this may be of help: https://stats.stackexchange.com/questions/89214/equivalence-of-aic-and-p-values-in-model-selection – Robert Long Sep 22 '20 at 09:13
  • Thanks for the link, Robert. I'll update my question. – koteletje Sep 22 '20 at 09:38
  • 3
    I'm still not sure why you want to do backwards stepwise ? A lot of statisticians on here may view the question like *"Doctor, I know there are issues with electro-convulsive therapy to treat anxiety but please can you tell me how to do it to myself anyway, in particular how will I know when to stop?"* ;) – Robert Long Sep 22 '20 at 13:12
  • Well, the data is complex and I am not an expert on the topic of the data. So, who knows what the underlying data generating model is? I can build a model manually (which I am actually doing as well), but due to the complexity of the relations between target variable and predictors, I would like to use a more automated model selection process as well as means of seeing whether I have omitted important predictors or linear combinations of predictors. (1/2) – koteletje Sep 22 '20 at 14:08
  • I am interested in a mix of inference and prediction – it is not clear where to draw the line. In any case, I have a hold-out sample to test the models on, so I am less concerned about overfitting through stepwise model selection. (2/2) – koteletje Sep 22 '20 at 14:09
  • 1
    Obviously you do not want to discuss stepwise regression. Still you should be aware that you can do ordinal regression using LASSO: https://cran.r-project.org/web/packages/glmnetcr/vignettes/glmnetcr.pdf – Bernhard Sep 23 '20 at 13:30
  • Thank you @Bernhard, your advice led me to the ordinalnet package (https://cran.r-project.org/web/packages/ordinalNet/ordinalNet.pdf). The package fits ordinal regression models with elastic net penalty. It seems better than the package you suggested for my purpose: I can estimate the cumulative probability model with coefficient (except the intercept) equal for all classes. – koteletje Oct 20 '20 at 14:17
  • 1
    Glad to hear. You're welcome. I'd suggest you make that an answer to your own question so people can see this question has an answer without reading through all the comments. Yes, you can and should do that: https://stats.stackexchange.com/help/self-answer – Bernhard Oct 20 '20 at 14:38

0 Answers0