1

Suppose there are $p = 3$ variables total and suppose the forward stepwise procedure selects the third variable. The forward stepwise procedure will assign it a positive coefficient if and only if the following two conditions are true: $$X_3^Ty/||X_3||_2 \geq \pm X_1^Ty/||X_1||_2$$ and $$X_3^Ty/||X_3||_2 \geq \pm X_2^Ty/||X_2||_2$$ where $X_j$ is the $j$th column vector of the design matrix, $X \in \mathbb{R}^{N \times p}$, and $y\in \mathbb{R}^n$ is the response vector.

My question is: how does "the 3rd variable minimizes the residual sum error" translate to the above two conditions?

From my understanding, the procedure would select the third variable $X_3$ if $\sum_{i=1}^N (y_i - \hat{\beta}_3X_{3i})^2$ is smaller than $\sum_{i=1}^N (y_i - \hat{\beta}_1X_{1i})^2$ and $\sum_{i=1}^N (y_i - \hat{\beta}_2X_{2i})^2,$ where $\hat{\beta}_j = \left(X_j^TX_j\right)^{-1}X_j^Ty$ is the OLS estimate for the $j$th variable. How does this translate to the 2 conditions I've listed above? I think the above conditions are saying the following:

"$\hat{\beta}_3 = \frac{\sum_{i=1}^N X_{3i}y_i}{\sum_{i=1}^N X_{3i}^2}$ is positive iff $\frac{\sum_{i=1}^N X_{3i}y_i}{\sqrt{\sum_{i=1}^N X_{3i}^2}}\geq \pm \frac{\sum_{i=1}^N X_{1i}y_i}{\sqrt{\sum_{i=1}^N X_{1i}^2}}$ and $\frac{\sum_{i=1}^N X_{3i}y_i}{\sqrt{\sum_{i=1}^N X_{3i}^2}}\geq \pm \frac{\sum_{i=1}^N X_{2i}y_i}{\sqrt{\sum_{i=1}^N X_{2i}^2}}$"

"$\hat{\beta}_3 = \frac{\sum_{i=1}^N X_{3i}y_i}{\sum_{i=1}^N X_{3i}^2}$ is negative iff $\frac{-\sum_{i=1}^N X_{3i}y_i}{\sqrt{\sum_{i=1}^N X_{3i}^2}}\geq \pm \frac{\sum_{i=1}^N X_{1i}y_i}{\sqrt{\sum_{i=1}^N X_{1i}^2}}$ and $\frac{-\sum_{i=1}^N X_{3i}y_i}{\sqrt{\sum_{i=1}^N X_{3i}^2}}\geq \pm \frac{\sum_{i=1}^N X_{2i}y_i}{\sqrt{\sum_{i=1}^N X_{2i}^2}}$"

But why are these two conditions true?

Adrian
  • 1,665
  • 3
  • 22
  • 42

1 Answers1

-1

Those conditions are only true for the first step in the procedure.

In the first step there is only gonna be one single variable in the model.

The variable that will result in the lowest residual sum of squares will be the variable with the highest correlation with the output $y$.

The first step in forward stepwise regression selects the same variable as LASSO. You might be helped with the graphical explanation of this first step in this question: What is the smallest $\lambda$ that gives a 0 component in lasso?

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • Can you explain how choosing the lowest residual sum of squares is equivalent to choosing the variable with the highest correlation with the output $y$? And why does the sign matter? – Adrian Jan 19 '21 at 01:36
  • I think what I don't understand is how is "the 3rd variable minimizes the residual sum of squares" translates to $X_3^Ty/||X_3||_2 \geq \pm X_1^Ty/||X_1||_2$ and $X_3^Ty/||X_3||_2 \geq \pm X_2^Ty/||X_2||_2$ – Adrian Jan 19 '21 at 02:06
  • @Adrian I hope that this downvote was not yours. I was working on improving my answer. – Sextus Empiricus Jan 20 '21 at 15:57