Should the final R glm include only significant levels of factors

Question

I am running a glm in R on data with quite many predictors (~50), both initially continuous and factors. The response is binary and the volume of the data is OK (~100K rows), in order to model non-linear relationships, I convert the continuous variables to factors as well. In the end this results in some levels of some variables being insignificant. For example, variable "age" is binned into 20-30, 40-50, 50-60, 60+ and only the first two are significant.

All this results in a reasonable model with AUC 0.86 and residuals looking random with a small bias.

I understand that R converts each factor level to a binary variable and then runs the regression, and I'd still like to improve the model. Would you please help me understand if:

converting all the variables' factor levels to binary myself,
selecting only the variables that were significant in the initial model
then re-fitting the model with the new (reduced) variable set

sound like a good idea?

Or should I continue to use predict() on the model with many insignificant coefficients for factor levels happily as before?

Possible duplicate of [Can I ignore coefficients for non-significant levels of factors in a linear model?](https://stats.stackexchange.com/questions/24298/can-i-ignore-coefficients-for-non-significant-levels-of-factors-in-a-linear-mode) — kjetil b halvorsen, Mar 17 '19 at 10:42

Vincent · Accepted Answer · 2014-06-02T09:19:10.580

For those initial factor predictors, it is arguable whether those insignificant levels should be "merged". Note that your approach seems to be simply dropping those insignificant levels, this approach is incorrect.

For a significant factor predictors (at least one level is significant), the number of its significant levels depends on which level is chosen as the base level. Because the estimate of the level is the difference between that level and the base level. For example, a significant factor has 4 levels A, B, C, D. If we choose level A as base level, we get result as below (only level D is significant)

B .

C .

D ****

However, when we choose level D as base level, we will find that all the levels are significant.

A ****

B ****

C ****

Because level A, B, C are similar while the level D is different from them.

As a result, simply dropping insignificant levels does not make sense. Lots of researcher think we should include all the levels as long as one of them is significant. The programmer of R is one of these. And this approach is simple.

Some researcher of other school think we can "merge" those insignificant levels to reduce the number of parameters. But this idea needs a more sophisticated approach to test all the potential combination of merging levels step by step.

For the example above, we can try to merge AB, AC and BC first and get three new models and find the best one (let we say AB, then we get AB, C, D). Then we can try to merge AB and C together and test it, because we should only drop one parameters and test it for each step. For those significant levels, we should also try to merge them for the reason mentioned first. So if we follow this school of researchers, the working load increase a lot (because we have to try all the combination pairs of the levels step by step, both significant levels and insignificant levels)

For those initial continuous variable, we may split them as factor/group, but this approach also need a sophisticated testing. We should firstly consider that numeric variable as a one-level factor, then try to split it into 2-level factor with all different splitting points and choose the best point. Then try different points to split one of the levels into two levels again. This idea is similar to the CART (classification and regression tree), which also splits numeric into discrete group/node, in order to model the non-linear effect.

Besides, we can use spline and so on to model the non-linear effect, which may be easier in some cases than splitting into factors.

+1 for tackling the issue of differential significance in the levels of a categorical variable. @Frank suggestion is important because in the majority of instances categorical variables have been forced as such, having an underlying continuous distribution. In those cases when one truly has an underlying categorical variable, a joint test for all levels simulataneously (F, Likelihood ratio) may give further insight. Of course, combining levels, if done correctly can also be helpful. — Thomas Speidel, May 29 '14 at 13:59
"Because level A, B, C are similar while the level D is different from them." - may I conclude from this that it makes sense to set the "most different" factor level as reference? — coulminer, Jun 04 '14 at 08:29
@coulminer No matter which level as the base, the model is essentially the same. Only the estimate value is different, but the variable meaning (difference to the base) is also different correspondingly. It is not necessary to change the base level to make all the result table "look" lots of "***". This means that even though some of the level in the factor is not "significant" itself, we should not simply drop them directly. — Vincent, Jun 04 '14 at 11:11

score 5 · Answer 2 · answered May 29 '14 at 11:57

5

It is not correct to categorize continuous variables and it is not correct to drop "insignificant" variables from a regression model. Allow continuous variables to have smooth nonlinear effects using regression splines. Dropping variables leads to bias, especially in estimating standard errors. If you want to drop variables, do so before looking at $Y$ using a data reduction or redundancy analysis.

answered May 29 '14 at 11:57

Frank Harrell

74,029
5
148
322

Thanks for advice. I am using binning in order to have more control of what the response looks like for each variable. Also, a bit constrained here because the system in production is able to work with levels not with splines. By "data reduction" do you mean methods like MDR to reduce the width (columns) or reducing the height (rows)? – coulminer May 29 '14 at 13:01
3

Binning is not the way to achieve your goal. It makes the relationship discontinuous will can ruin the process. If by MDR you mean the technique used in genomics, that is not a proper data reduction method, in the sense that it peeks at $Y$. Look at things like variable clustering, principal components, redundancy analysis. – Frank Harrell May 29 '14 at 15:12

Should the final R glm include only significant levels of factors

2 Answers2

Linked