0

I have a dataset with count variable of 50 observation and 260 independent variables. As the variance exceeds mean, I want to use Negative Binomial distribution. My objective is to build a model that can be used for future prediction. So, a model validation is required. Given that my p exceeds n, I need to use penalized regression, like lasso or Elastic net. With the R package "mpath", I can find the best subset but not the coefficient estimate and P- value. The "Glmnet" package doesn't have negative binomial or P-value. The "BeSS" package gives P-value but not for NB model. The "GLMMLasso" package looks promising but again not for NB only Poisson.

Can anyone help me with suggestions which method should I use for my subset selection, model building, and validation? Is there any package in R or SAS? Thanks in advance

MSilvy
  • 139
  • 1
  • 8
  • Once you find the best subset of predictors, can't you just use the standard glm package to get coefficients? A p-value is only useful when testing a hypothesis and will be misleading at this case since all remaining variables will necessarily have inflated significance. It is better to use cross validation to test the predictive power of your model. – deasmhumnha Mar 23 '18 at 05:05
  • I was going through other answers and found this. https://stats.stackexchange.com/questions/269949/lasso-regression-coefficients-values Using normal regression after Lasso is not recommended – MSilvy Mar 23 '18 at 15:56
  • I don't quite agree with that explaination. While post-selection regression does invalidate p-values, so do all penalized selection methods, including LASSO: https://stackoverflow.com/a/17725220/4143644! That is why packages don't provide these statistics, because they don't mean anything. Secondly, regularization can easily be added into the second regression as well. Ultimately, if prediction is your goal, the only thing that matters in applicability to unseen data, which is why I suggested cross validation as the best way to test your model. – deasmhumnha Mar 23 '18 at 16:53
  • As far as which package to use, you might have to code your own NB + LASSO algorithm using general model fitting tools available in R. The two step method seems interesting however. – deasmhumnha Mar 23 '18 at 16:55
  • Thanks a lot. Mpath has Lasso+Nb and Penalized has Lasso+Poisson. Let me try both and see how things come out. Thanks again for your input. Btw, have you come across any paper(application not theory) that used lasso and then took the selected variables for GLM. – MSilvy Mar 23 '18 at 18:17
  • Honestly, I can't think of any off the cuff. Most of the papers I read just use the LASSO coefficient. I proposed post LASSO glm to get the coefficients that "mpath" does not provide, but on second thought it seems strange that it would not provide them. Have you tried using the summary() function on the model output? – deasmhumnha Mar 23 '18 at 23:15
  • Looking at mpath package description, you get the all coefficients along the path in the $beta variable of the model object. You can figure out which lambda is best using cv.glmregNB. This function also seems to fit the model using the full data as well under the $fit variable. – deasmhumnha Mar 23 '18 at 23:30
  • Thanks a lot Dezmond. You have been really helpful. I will try mpath and penalized both. – MSilvy Mar 25 '18 at 03:07

0 Answers0