0

When using forward selection for multiple linear regression, I've seen several metrics:

(1) Using MSE - at each step, try adding each variable one at a time, see which variable reduces the MSE the most, add that variable to the multiple linear regression, and repeat.

e.g., say we're trying to predict weight of a person. Our features are height, age, gender. At the first step, we just have the intercept, then we fit a simple linear regression model, one for height, age, gender, waist size, chest size. Find which one results in the smallest MSE (say it's height here), then add height to our model, and go to the next step. Now we have age, gender, waist size, chest size left. At the next step, we'll add age and find the MSE, remove age and add gender, find the MSE, etc... Then find which of the 4 remaining variables results in the smallest MSE and add that variable.

(2) Using $R^2$ - at each step, try adding each variable one at a time, see which variable reduces the $R^2$ the most, add that variable to the multiple linear regression, and repeat. The procedure is the same as above, except we use $R^2$ instead of MSE.

(3) Using p-values and t score - at each step, see which variable's coefficient has the smallest p value and use that variable, though this one seems to make less sense to me because there's not a normalization. e.g., if we already have height in our model, and we're trying to figure out which one of the following models to use:

(height, age)

(height, gender)

(height, waist size)

(height, chest size)

The p values for age, gender, wait size, and chest size don't appear to be comparable? In addition, the p value for height may be different across the 4 models as well.

So (3) doesn't seem like a great option. How about (1) and (2) or other methods to perform forward selection?

student010101
  • 334
  • 2
  • 10
  • Please read the discussion on [this page](https://stats.stackexchange.com/q/20836/28500), or the many related threads on this site, and re-think your approach. Forward selection is perhaps the worst of all methods for automated model building. None of your proposed methods is a "great option"; there are many better ones. – EdM Apr 07 '21 at 16:26
  • @EdM But I'm specifically trying to learn about (and make sense of) forward selection, and not why it's not an ideal approach. – student010101 Apr 07 '21 at 16:48

0 Answers0