Which coefficients to include in out-of-sample prediction?

Question

I am estimating a logistic regression on a subset of my data and predicting the outcome for the whole sample.

In the regression only some data are statistically significant at the chosen level. Should I include only those regressors in the prediction or all of them?
I cannot chose my preferred model specification as it includes years Fixed Effects and there is no perfect overlapping year-wise between the samples. Should I just go for my second-best specification without the Fixed Effects?
How do I go about the first point in R?

For reference, my code in R looks like this

db_tr <- data[data$group=="1",]
db_pr <- data[data$group=="2",]

f <- formula(y ~ x1 + x2 + x2 + factor(year)
logit.training <- glm(f,family=binomial(link="logit"), data= db_tr)
logit.predict <- predict(logit.training, newdata= db_pr, type="response")

I'm having trouble understanding your description in point 2. Is it possible to clarify this? — user20160, Jan 05 '21 at 19:39
My toy database has "year" among its variables that I use for time fixed effects. However if the training database (i.e. `db_tr`) covers say years 1 to 5 and the prediction dataset (i.e. `db_pr`) covers years 6 to 10 with no overlap, despite getting a more precisely estimated estimates if I include time fixed effects, I cannot include them in the formula because I get the following error in R: `Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor factor(year) has new levels 6, 7, 8, 9, 10`: — MCS, Jan 06 '21 at 16:35
"Which variables to include?" I would refer you to Hosmer & Lemenshow's book "Applied Logistic Regression", chapter 4: Model-Building Strategies and Methods for Logistic Regression, and in particular Section 4.2 Purposeful Selection of Covariates. — ColorStatistics, Jan 06 '21 at 18:17

score 1 · Answer 1 · answered Jan 06 '21 at 18:05

Not sure what will constitute "reputable" in your eyes, but here is an approach.

There is no need to remove predictors which fail to achieve statistical significance. I suspect the thinking here is that statistical significance is some sign of importance, but that isn't completely true. Statistically significant effects can be small, and have little impact on the predicted probabilities. Additionally, large effects can have large uncertainty. Removing these may hinder the model.
I'm not sure quite what you mean here. I see you're modelling year as a factor, which I would advise against. Time should be modelled as a continuous variable, if not only because in the model will not be usable in future years. For example, if your data had years 2015 through 2020, and you passed the model data from 2021, the 2021 is not recognized as a factor level, and thus has no regression coefficient. Modelling the year variable as continuous avoids this and allows years not in the training data to be predicted on. You can easily verify this in R.

1. My struggle lays specifically with the uncertainty to whether include large effects with large uncertainty (not statistically significant) that would skew my predicted probabilities. — MCS, Jan 06 '21 at 18:38
2. Your suggestion certainly solves the error I receive, but I was hoping to include in my model Year-Specific Fixed Effects using dummy variables (LSDV Model). Would not model years as continuous variable fail to do that? — MCS, Jan 06 '21 at 18:42

score 1 · Accepted Answer · answered Jan 07 '21 at 00:19

This started as a comment to Demetri's answer, but it got too long so I made it a proper answer.

If you are trying to get non-linear effects for the time variable, you should model it non-linearly. Fixed effects as you propose will help you with that perfectly since each year will have its orthogonal contribution as a parameter. But if you try to predict an out of time sample (a data point in which the year is not among the ones in the training data) that point will not have a corresponding year in the model matrix. Remember, if you have a categorical variable with 3 categories, and say 4 samples/observations, this is how you write the model matrix, where the $A$ category is the reference category:

$$ \begin{bmatrix} A \\ B \\ C \\ C \end{bmatrix}_{\text{samples}} = \begin{bmatrix} 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \end{bmatrix}_{\text{model matrix}} $$

In your approach, each category is a year, i.e. a fixed effect. Now, what do you do if a new sample has category $D$? One of two things happen, you either don't have a prediction or you erroneously give it no parameter or the last parameter in.

If your focus is on in-sample prediction, this isn't a problem at all, and I argue that, if you have large enough data, this is the best approach. Even if the standard errors are large (which indicates that this information, year in our case, is noisy and most likely other variables better estimate the variation in the outcome variable) their estimates can be useful in a prediction setting.

However, if out of time sample estimation is a goal or if you can't spare the degrees of freedom (another way of saying you don't have enough data) a continuous modeling approach is the way to go. By continuous we don't mean linear. There are many ways of doing so, the simplest of them is a simple polynomial transformation of the year variable. Another way is through splines which can come in many flavors (natural cubic splines, B-splines, etc...), and can be included in your model via the splines or mgcv packages in R. This class of modeling techniques is also called GAM or generalized additive models in case you'd like to research about it.

I just saw your comment about the years in the prediction data set. Your problem is solved by splines. To use splines (without penalization/ regularization) with the base-r glm function, you'll need the splines package which comes installed with R. This is how to fit the model:

library(splines)
db_tr <- data[data$group=="1",]
db_pr <- data[data$group=="2",]

f <- formula(y ~ x1 + x2 + x2 + bs(year))
logit.training <- glm(f,family=binomial(link="logit"), data= db_tr)
logit.predict <- predict(logit.training, newdata= db_pr, type="response")

bs stands for B-splines and is the R function to create the B-spline transformation of a variable passed to it in the model matrix created by the formula.

However, the preferred way to estimate those types of models is through a gam in the mgcv package. Here you can find more about Gams.

This solves my second point, thank you. Any insight with respect to the first one? — MCS, Jan 09 '21 at 16:45
I think I mention your worry about parameter significance in the 4th paragraph of my answer. What further thoughts do you need? — Guilherme Marthe, Jan 11 '21 at 14:27

Which coefficients to include in out-of-sample prediction?

2 Answers2