Classification model - As good as it gets?

Question

So I am relatively new to all this and I apologize if I don't provide enough information, please let me know. I have been grinding away at a problem for a couple of weeks and I am at an impasse. I have a year's worth of clinic appointments from the practice I work at and am trying to create a model that will help predict the likelihood of a patient no-showing or late-canceling.

At this time all of the data is in categorical factors. I have had the best success using a Naives Bayes model, sort of: AUC.5921 Accuracy.8785 Sensitivity.43179 Specificity.90748 Some of the models have had higher AUC and accuracy scores but then the sensitivity was in the toilet. If I am understanding correctly sensitivity is the ability to accurately predict true positive, when someone will actually miss their appointment.

In addition to this I have used Random Forest, NeuralNet, Logistic Regress as well as different sampling methods (over, under, both, ROSE) to deal with my data being imbalanced (approx. 88/12).

What I am coming back to is factor analysis. I feel like I am missing something. I get how to run it and such but I think it may not be effective in my case as all 5 components only account for 13% of variance. Is there something I should be doing differently to get better results out of this model or is there simply nothing to be done since the data appears to be too highly correlated?

Again, I am happy to provide any code, outputs, sample data etc. I just wasn't' 100% sure on what would be helpful.

Thank you

EdM · Accepted Answer · 2021-02-22T16:53:22.983

Big-picture issues

You are confusing the probability-modeling aspect of your problem with its decision-threshold aspect.

Accuracy, sensitivity, and specificity depend on a particular choice of a probability cutoff along your receiver operating characteristic (ROC) curve. That's often a hidden choice within the software, of a 0.5 predicted probability cutoff. Values of accuracy, sensitivity or specificity based on a single cutoff are thus poor measures of model quality. The AUC, the area under the ROC curve, isn't perfect but as a measure of model quality it at least takes all potential probability cutoffs into account. See the link above for the importance of proper scoring rules like log-loss or the Brier score, which evaluate the full probability model. My guess is that your model with the highest AUC will be the best overall model for predicting class probabilities, unless it's overfit.

For a decision threshold based on estimated probability, the default choice of 0.5 makes an implicit assumption that the costs of false-positive and false-negative class assignments are identical. Is that the case for your application? If not, then you shouldn't be using a cutoff of p = 0.5. You should be using a different probability cutoff appropriate to the relative misclassification costs. See this answer, this page, and their links for ways to incorporate cost estimates into your choice of a decision threshold.

I would recommend focusing more on the models with high AUC values and see how well they perform when you take your misclassification costs into account. That said, you might well be close to "as good as it gets" based on the information that you have. For example, an AUC of 0.67 means that if you randomly choose 1 member of each class, your class-membership probability predictions will be in the wrong order for that pair 1/3 of the time. Is that good enough for your application?

Other things to try

You've already tried many of the modeling approaches used for this type of problem, so you might be stuck with being "as good as it gets" for your data. Your data aren't that badly unbalanced at 12%/88%, and with your data set you seem to have over 6000 cases in the minority class, giving you a lot of room to include available predictors. See this thread among others on this site for further discussion about imbalance; changes in data sampling are unlikely to help.

One thing that you might be able to take more advantage of, given the size of your data set, is interactions among the predictors in your models. I don't know how many predictor interactions you allowed for in your models to date. With over 6000 members of the minority class, you might be able to include up to 400 predictors and predictor combinations/interactions in your model without overfitting, allowing for many interaction terms.

Gradient boosting can allow for a large number of potential interactions, with a complexity set by the depth you allow for the trees. Gradient boosting thus can let the data determine which predictor interactions are most important, a potential advantage over your defining all the candidate interactions in advance for a standard logistic regression model. You should use a slow learning rate to minimize the chance of overfitting the data. A log-loss penalty on the pseudo residuals makes the process follow the loss criterion of logistic regression. That overcomes the dangers discussed above of the implicit probability cutoff in accuracy-based cost criteria that might otherwise be used in boosting (and probably were used in your neural net as well). See this answer and its links for more details.

I get what you are saying but what I am asking is if there is something I can do to increase the AUC of the model? I have over 53000 lines in the data. Is there some sort of balancing or sorting that I could do or is that it? — Retep Yarrum, Feb 21 '21 at 22:52
@RetepYarrum I added some suggestions at the end of the answer. — EdM, Feb 22 '21 at 16:54
@EdM Thanks so much, This is exactly the kind of info I am looking for. Just struggling trying to find the best path for the data. I will read up on your provided links. I wish there was a list of predictive models for classification somewhere that listed their strengths, weaknesses and desired data format. Thanks again for taking the time :) — Retep Yarrum, Feb 24 '21 at 05:36
@EdM I am working in R and have been researching Gradient Boosting. I see many mentions of xgboost but it seems that it does not handle categorical data by default. I realize that I could utilize different methods to encode those categorical variables or do you recommend a different package or method? My data could easily be all categorical though some would be nominal while other would be ordinal — Retep Yarrum, Feb 24 '21 at 06:15
@RetepYarrum [This vignette](https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html#numeric-v.s.-categorical-variables) shows how to convert categorical predictors to numeric for this purpose. — EdM, Feb 24 '21 at 08:56

Classification model - As good as it gets?

1 Answers1