logistic regression - unbalanced classes

Question

Let's suppose I have a dataset with classes A and B, with class A occurring in 1% of cases and class B occurring in 99% of cases. Perhaps class A is a loan default.

Suppose I want to "understand" what factors make one class A vs B by fitting a logistic regression to dependent variables X and then looking at model coefficients. Does it make sense to put a higher class_weight on class A, such as putting a weight of 99 on class A and 1 on class B? Or does the intercept already take care of this?

What if the logistic regression has a regularization parameter, would class imbalance matter more in this scenario? (because the model would be more inclined towards a constant "Predict B" model to reduce the penalty on coefficient size).

I've seen many economics papers in which a unregularized logistic regression is run on data and then the authors interpret coefficient sizes and significance, just wondering how valid this is.

score 0 · Answer 1 · answered Sep 05 '16 at 21:59

0

I assume by "factors" you mean which features or variables are "most" important for distinguishing between the 2 classes? I think that you first need to establish a performance metric to check if the results make sense. E.g., on such a dataset, just always predicting the majority class will already give you a classification accuracy of 99% (or error of 1%). Depending on the size of the dataset, you may want to look at ROC auc or precision-recall aucs (maybe F1) to get an idea how well your model discriminates between the classes.

In addition (also depending on the size of your dataset) you may want to cross-validate, e.g., 0.632+ bootstrapping, 10-fold CV, etc to get an idea about the stability of your model. If you have a sufficiently large dataset, regularization (L1 or L2) may give you a good idea of how important certain features are, answering your question "Suppose I want to "understand" what factors make one class A vs B " -- but this only makes sense if a linear model is appropriate given the data, and if there's sufficient stability of your model in CV I'd say.

answered Sep 05 '16 at 21:59

Thanks Sebastian. I've read many papers in which the implicit assumption is that the underlying data was generated from a linear model, y = Xb + e, and then inference is done by looking at the coefficient sizes and standard errors. But this doesn't quite make sense to me since why should we expect for the true model to be linear? Doesn't it make more sense to try to find a model with a high AUC and look at the coefficients of that model? – convolutedstatistic Sep 06 '16 at 13:03
"why should we expect for the true model to be linear?" That's the assumption of logistic regression since it is a generalized linear model. Let's say your data is not linear, then your model will perform poorly but you still have the linear relationship between inputs and weights z = w_0 + x_1*w_1 + ... + x_n*w_n. That's not going to change, but if your model doesn't capture the structure of the data, the weight coefficient's magnitudes may be meaningless for an importance interpretation -- they still tell you about which features are important for the classifier to make the decision though. – Sep 06 '16 at 14:24
" Doesn't it make more sense to try to find a model with a high AUC and look at the coefficients of that model?" A (standard) logistic regression model is always linear, so if you have a non-linear problem, it maximizing accuracy or AUC via model selection is not going to help. You could try algorithms for non-linear hypothesis spaces, but then you can't simply read the weight coefficients (if its parametric) as importances. – Sep 06 '16 at 14:27
but very few models in reality would be perfectly linear, so I was thinking of interpreting the model more in the sense of being a good linear approximation to the true conditional expectation E[Y|X] – convolutedstatistic Sep 06 '16 at 16:10
I absolutely agree with you. What I was trying to explain was that it is not about the "goodness" of the linear approximation but is something like "the feature coefficients tell you how important each feature is for the logistic regression output" rather than "the coefficients tell you if a feature is generally a good predictor or not, whether you assume a linear model or not". Or in other words the coefficients tell you how important the features are under a linear assumption. – Sep 06 '16 at 18:31
But don't the parameters of a model with very good goodness of fit become more meaningful? Similar to how we "explain" things using classical mechanics even though those equations are only approximations to the truth. For example, let's say the true conditional expectation function is f(X) = E[Y|X] and our fitted linear function is g(X). If some dependent variable has a large influence on g(x), can't we say it has a large influence on f(X) if g is very "close" to f (according to our desired metric)? – convolutedstatistic Sep 06 '16 at 19:43
"parameters of a model with very good goodness of fit become more meaningful?" Yeah, they are more meaningful then under the assumption of a linear relationship. A "large" feature weight has always a "large" influence on "g(x)" -- given that the features are standardized/normalized of course. And if a linear model is reasonable for the given dataset, we can then also conclude that it has a large influence on f(x) if this was what you were asking. – Sep 06 '16 at 20:26
How do you tell if a linear model is "reasonable"? If a linear model has an extremely high goodness of fit, doesn't that imply that a linear model is reasonable? (is there an example in which a linear model has very high goodness of fit but a linear model is not reasonable?) Of course there could be a model such as y = x + e where var(e) >> var(x) and e is gaussian noise, in which I guess checking that E[e|x] = 0 would be more appropriate – convolutedstatistic Sep 06 '16 at 22:38
"is there an example in which a linear model has very high goodness of fit but a linear model is not reasonable?" -> I think this is not a large problem with linear models, but I would say that good generalization performance would be an indicator. E.g., you may determine a good fit when you'd do resubstitution evaluation, but the model doesn't perform well on new data due to overfitting. The coefficients tell you then which coefficients are important for yielding the classifier output, but it wouldn't tell you that much about which features are really important (on a population level) – Sep 07 '16 at 00:25
Right but let's say we have a linear model that has a very high goodness of fit out of sample. Given only this fact, would you say that the linear model is "reasonable?" In which case the coefficients would tell us something about importance of different variables on a population level. – convolutedstatistic Sep 07 '16 at 12:12

logistic regression - unbalanced classes

1 Answers1