1

In my dataset, I have one dependent variable and 6 explanatory variables and I am interested in the question which predictor is the best. The relationship between the explanatory variables and the dependent variable is linear.

My initial idea was to fit one linear model for each of the predictors and compare the resulting RMSEs. In order to take the uncertainty into consideration my idea was to obtain quantiles for the RMSE by bootstrapping.

Now I was told that I can compare the predictive accuracy of the covariates by fitting one multiple regression model that contains all the 6 explanatory variables and looking at the p-values from the t-Tests. In my case this resulted in only one significant Test-result. However I believe that this is not really the correct way to proceed as I have read e.g. here that p-values should not be used to assess feature importance.

My questions are:

  1. Which of the two ways I suggested are legitimate
  2. What would be another (maybe even standard) procedure to solve this task?
Sebastian
  • 2,733
  • 8
  • 24
  • I have used the "leave-one-out" technique, where each predictor is iteratively removed from the regression one at a a time to determine if it has any effect on the fitting results. This can sometimes be useful to weed out the less useful predictors that are not contributing to the regression model. – James Phillips Dec 14 '19 at 17:23

1 Answers1

1

One-at-a-time regressions can suffer from omitted-variable bias. This is true in linear regressions when the omitted variable is correlated both with outcome and with the included predictors, as discussed on this page.*

So one should be highly skeptical of results with one-at-a-time analysis.

Of the 2 methods you propose, multiple regression is the better way to go, provided that you have enough cases that you aren't overfitting. When your multiple regression returns only 1 coefficient significantly different from a value of 0, that means that you only have 1 predictor that is significantly associated with outcome when the other predictors are taken into account by the multiple regression, given the size of your data sample.

That does not mean, however, that the single "significant" predictor is the only (or even most) "important" feature. In particular, you shouldn't just go ahead blindly with a model based solely on that predictor. When predictors are correlated (as they generally are) the issue of which are "most important" becomes quite tricky and the particular choice can depend heavily on the particular data sample at hand. This page discusses the problems with trying to automate feature selection.

There are 2 other approaches that you might consider.

One is LASSO, which provides a principled way to identify a set of features most useful for prediction. Coefficients of retained features are penalized to absolute values lower than they would be in a standard regression based on them, to reduce overfitting. The retained features might not be the "most important" in some theoretical sense, but they can often work well for prediction.

The second is boosted regression trees. That approach can allow for non-linearities and for interactions among features. Measures of feature importance are then based on the difference that omitting a feature makes in terms of model performance. Those measures of importance can be difficult to interpret, however, as they include both direct and interaction terms involving each feature. And again, the importance measure is only in the context of the entire model.

So think carefully about what you mean by "which predictor is the best." There might be no single, simple answer to that question.


*For other types of regressions like logistic or Cox proportional hazards regressions, omitting any predictor associated with outcome will lead to bias in the regression coefficients for the included predictors, regardless of correlations with the included predictors. See this page for a nice analytic proof in the case of profit regressions.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • See the Diebold-Mariano for a totally different way of looking at your problem. It's non-parametric in the sense that, given the forecasts of each model, you don't need to know anything about the original models. Note that the test is purely for evaluating predictive ability. Nothing else. – mlofton Dec 14 '19 at 19:30
  • @mlofton note that Diebold-Mariano was designed for comparing forecasts, not comparing models per se. See [Diebold 2012](https://www.nber.org/papers/w18391) for extensive discussion of application to model comparison, which is what the OP seems to be getting at. – EdM Dec 14 '19 at 19:41
  • 1
    yes, if he wants to do "model" comparisons, DM test is not the thing to use. I tried to stress that but you did a better job. It's purely a tool for saying that "this set of forecasts is better than that set of forecasts". The forecasts could come from anywhere including a crayon.Thanks. – mlofton Dec 14 '19 at 19:45