Grouping predictor factor levels based on response variable

Question

I've read that it's bad to do this, but am looking for details as to why.

Suppose we're trying to fit the linear model $Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \epsilon_i$ where $Y$ is continuous and $X$s are categorical with many levels. We don't have any opinion on how the levels should be grouped. We then calculate the average response for each level of the predictors, then collapse levels with similar averages.

How badly does this screw up parameter estimates and inferences on them, and why?

Search this site for "fused lasso" or: https://stats.stackexchange.com/questions/227125/preprocess-categorical-variables-with-many-values/277302#277302 — kjetil b halvorsen, May 16 '17 at 23:38

score 5 · Accepted Answer · answered Apr 02 '14 at 05:27

This may well depend to an extent on how exactly you will decide which factor levels to pool.

Parameter estimates (e.g., for prediction) may even get better by pooling. Essentially, you are building a more parsimonious model. Your parameter estimates will have lower variance (because of the pooling) but higher bias, this is the ubiquitous bias-variance tradeoff. Your pooling reminds me of trees, which are certainly standard methods for prediction (with extensions, like random forests).

Inferences will probably be a little more tricky. Standard theory is not valid any more, since you are transforming your data after estimation. So don't look for $t$ tables to read off $p$ values. However, you are not filtering on low $p$ values, nor looking for "optimal cutpoints", but explicitly pooling levels with similar parameter estimates - so this could actually not be too far away from the standard tables.

If you do decide to go this way - as I said, this method could actually improve the predictive performance of your model, I recommend that you do some bootstrapping to assess how variable your parameter pooling is.

sorry I'm a bit slow but regarding "Standard theory is not valid any more, since you are transforming your data after estimation." -- what assumptions in particular fail to hold? is it because we have some sort of mixture distributions for the $t$ statistics due to collapsing or some multiplicity issue? — kevinykuo, Apr 03 '14 at 05:11
Look at the derivation of why $t$ statistics of regression coefficient estimates are $t$ distributed. This is entirely driven by data and observations separately. Pooling amounts to looking at the data and the observations, transforming the data (pooling factor levels) and re-estimating. The final design matrix is not independent of the data any more. — Stephan Kolassa, Apr 03 '14 at 07:23

Grouping predictor factor levels based on response variable

1 Answers1

Related