Conflict between predicated outcomes in logistic regression

Question

I was using caret in R to use logistic regression to make prediction. I only have one predictor named OEI and the outcome variable is pass/fail. However, although I was able to perform that task and get confusion matrix, etc, I was trapped in a conceptual question about the transformation of logistic regression TO convert everything between probabilities between 0-1.

When I print out the predicted probabilities for the 'pass' cases, there are obviously probabilities higher/ lower than 0.5 which were further translated into pass/fail based on 0.5 cutoff

However, as I was trying to plot the relationship between the only predictor (OEI) and its associated probabilities of 'pass', all the predicted probabilities are above 0.5 which means all the data should result in a 'pass' outcome, but this is obviously not true based on the result from the previous step.

Here is my R code to plot the the relationship above:

Can anyone please help resolve this conflict?

[Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) is not an exact duplicate, but points you in the right direction. Don't discretize your perfectly usable probabilistic prediction using a threshold. — Stephan Kolassa, Apr 19 '19 at 20:42
@StephanKolassa, thank you. But my question is not about threshold but more about the fact that, based on the same set of predictor values, why the probabilities for pass differ in the 1st and 2nd plot. In the 2nd plot, regardless of OEI values, there are only pass result. While in the 1st plot you can see different OEI values lead to probabilities for pass/fail. — Edward Lin, Apr 19 '19 at 20:57
Double-check your formula for a_logits. The intercept is 0.01 and all values in X1_range are positive, so all logits are necessarily > 0 and thus your probabilities are necessarily all > 0.5. Make sure that the intercept is the one returned by your model; that doesn't seem likely given the predicted values in the first figure. — EdM, Apr 19 '19 at 21:00
@EdM, thanks, that's what I thought, too. But the intercept is indeed 0.01 in my fitted model. I did a few things such as cross validation and upsampling, but I don't think that is why. Do you have other ideas? — Edward Lin, Apr 19 '19 at 21:11
[UPDATE] @EdMI also scale and centered the variable when I pre-processing rhe data, without that =, the intercept will be -1. and coefficient of OEI IS 0.27, will that be why? — Edward Lin, Apr 19 '19 at 21:36

score 2 · Accepted Answer · answered Apr 20 '19 at 15:36

Although pre-processing data by centering and scaling is important for some approaches, it's not necessary for simple logistic regressions like this. As you note in a comment, without centering and scaling you get an intercept of -1 and a coefficient of 0.27 for your predictor OEI. With an intercept below 0 and a positive coefficient for this necessarily positive predictor, you now will have predicted probabilities both above and below 0.5 (logit of 0) if the OEI values extend both above and below a cutoff of about 3.7. If you used those coefficients for producing your plot I suspect that you would reproduce the values returned by predict() on your model and data, as shown in your table.

Two more thoughts for going forward.

First, with respect to centering and scaling, these are sometimes helpful and sometimes necessary depending on the modeling approach you are using. For example, in your model on untransformed data the intercept represents the log odds of passing when OEI = 0. If that's far outside the usual range of OEI values, it could be helpful to pre-center your data so that the intercept represents the log-odds for a more typical OEI value. That easy interpretability of the intercept can be even more dramatic when there are interaction terms in a model. In methods like principal-components, ridge or LASSO regressions that depend on comparable scales among all the predictors, it's usually necessary to pre-center and scale data before determining the principal components or applying the penalties for ridge or LASSO.

But in any case of centering and scaling you need to keep track of whether the reported coefficients represent the data in the original or in the transformed scales. Some software will by default scale all predictors in ridge or LASSO modeling but then adjust the reported coefficients back to the original data scales. I don't know how the caret package handles such situations. If you do the centering and scaling yourself you need to keep track yourself. I suspect that in your case the coefficients you used to produce your graph were for the centered and scaled values, but you then tried to use them with the original OEI values.

Second, you note in a comment that you did some oversampling. That is not typically a good idea. There is extensive discussion on this site about oversampling. Oversampling might seem to improve accuracy in some situations, but accuracy is not a good measure of the quality of a logistic regression model. A proper scoring rule like the Brier score is the best way to evaluate a model that returns probability estimates. Once you have a good model for probabilities you can proceed to use information about your application to set criteria for classification if necessary.

Thank you very much @EdM, all good points here. I learned a lot. Appreciate it. — Edward Lin, Apr 23 '19 at 14:23

Conflict between predicated outcomes in logistic regression

1 Answers1