0

I have a dataset with a binary dependent variable. There are 40,000 observations, from which 30,000 are from group 1 and 10,000 from group 2. I have a set of predictors. I tried running logistic regression, but ran into troubles, with the P-Values always being significant, due to the size of the data. I then tried looking at effect sizes to choose the variables for the final model. Then an idea came to me, to try and classifying models to help and choose the variables. But my data is unbalanced, and I know that it masks effects. My question to you, is double: 1. Can I even use logistic regression with unbalanced design? 2. I know that I can't use decision trees with a very unbalanced design, as it masks the classification effect, however, I read here a bit about balancing the data, and it making the conclusion a-posteriori. Can you please explain the meaning of this?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
BlueSigma
  • 408
  • 4
  • 12
  • 2
    You need to explain a bit more what you actually hope to achieve with this analysis. It sounds like you're trying to do variable selection. Why? (How many predictors do you have? Are they continuous or categorical - if the latter, how many levels?) – Ben Bolker Dec 24 '16 at 21:29
  • If I use logistic regression, then I am trying to indeed select which variables should be included in the model. Traditionally, I use Collet's algorithm, which uses P-Values. In this case, it's not relevant. Another approach could be to use data mining methods, for prediction, and then I can find which variables are related to the outcome. However, the data is not balanced. I know that balancing it with undersampling is better than not doing anything at all, but there is a problem with the prior and posterior probabilities and I am not sure how to correct it – BlueSigma Dec 25 '16 at 07:05

0 Answers0