0

I used a binary logit model with a lasso regularization term to predict an unbalanced dataset, where I used undersampling on the minority class (2% of observations) to get a 50/50 split of the classes.

Now I want to estimate the model coefficients, but get mostly statistically insignificant coefficient estimates when using the whole (unbalanced) dataset. After downsampling, the estimated coefficients become statistically significant and make sense considering past literature on this topic.

Is it a valid approach to downsample in order to get the coefficient estimates, or will this bias the coefficients somehow? The downsampled dataset ends up with about 50.000 records.

I have read about choice-based sampling, but I can't seem to figure out whether it applies to my problem.

Thanks

1 Answers1

0

will this bias the coefficients somehow

Yes, this will bias coefficients especially the intercept. Additionally, lasso is a penalized model and so each coefficient is biased towards 0. Unless you're using post selection methods a la Ryan Tibshirani, your approach is completely invalid.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • Thanks for the clarification! Do you have any recommendations on how I should go about estimating the effect of the independent variables? I have experimented with permutation importance, but they do not show the direction (does it positively or negatively affect the DV?) of the effects. – John Locke May 21 '21 at 21:26