Explanatory multiple linear regression

Question

I have a data set consisting of 104 responses from an online survey (data have been cleaned) and I want to test the relation between 14 independent variables (12 variables about which hypothesis have been generated + 2 "control" variables) and 1 dependent variable. The hypotheses have been formulated after indications of the literature.

I chose to do a regression analysis even though the sample size in not optimal. After running the "lm()" function in R I got an F-statistic with a p-value below 5% indicating that at least one of the independent variables are related to the dependent variable. What is more by looking at the p-values of the independent variables, 2 of those are below 5%.

What I want to ask is: should I keep all the variables in the model and report the results, or should I re-run the analysis including only the 2 variables with the low p-values?. Or is it better to do variable selection using automated methods e.g. stepwise selection?

My goal is to draw causal conclusions (explanatory regression) about the relation between independent variables and the dependent variable, so using stepwise methods for variable selection does not seem appropriate for my case, as I am not aiming for predictions.

You can if your goal is to better explain the relationships, but not to perform additional tests. — user2974951, Jan 30 '19 at 11:57

score 2 · Accepted Answer · answered Jan 30 '19 at 03:02

Stepwise methods for model selection are generally considered to be a bad idea (see e.g., here), and they have been rendered defunct by modern methods. These days it is usual to perform model selection either with some form of penalty method (e.g., LASSO or ridge regression) or by fitting all possible models and using partial F-tests to determine the appropriate number of model terms. All of these methods can be done in R without too much difficulty, though the latter is only computationally feasible if you have a relatively small number of regressor variables (which you do).

All possible models: In your data you have $m = 14$ variables which gives $M \equiv 2^m =2^{14} = 16,384$ possible models. (For simplicity I will assume that all models have an intercept term.) It would be feasible to compute all possible models in this case and then narrow your model selection down to those models that minimise residual sums-of-squares for a fixed number of regressors. To implement this method you can use the leaps package or use custom code here. Although preferred to stepwise methods, it is worth noting that even this method is somewhat controversial, and some consider it to constitute 'dredging'.

LASSO and ridge regression: These regression methods automatically engage in variable selection by imposing a penalty function on the inclusion of regressors in the model (see e.g., here). The objective function used in the model minimises the residual sum-of-squares plus a penalty applied to the coefficient estimators. There is a tuning parameter that you can adjust to penalise inclusion of terms more or less heavily, and this will affect the number of terms in the model.

Explanatory multiple linear regression

1 Answers1