To reduce the load on the machine I want to take the benefit of the undersampling approach. Here are a few facts about my data:
My data is of the order of 20 millions or even more.
The event rate is around 0.8%.
Using the undersampling approach, I want to reduce the non-event rate and make it either 70:30 or 50:50. This will help me reduce the load on my machine.
I require an accurate estimate of the dependent variable and it is not required only in terms of its order. Hence, I definitely want to calibrate the probabilities back.
Now, two questions:
Using Firth logistic regression (along with the weight statement) will help in this case or not?
Does undersampling help to build a better model in terms of more accurate regression coefficients? I think that all it should change is the intercept and the rest should be the same.