3

Let's assume the general linear model $\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$, where $\mathbf{y} \in \mathbb{R}^N$, $\mathbf{X}$ is a $N \times (p+1)$ matrix (where $p+1 < N$) with all entries in $\mathbb{R}$, $\boldsymbol\beta \in \mathbb{R}^{p+1}$, and $\boldsymbol\epsilon$ is a $N$-dimensional vector of real-valued random variables with $\mathbb{E}[\boldsymbol\epsilon] = \mathbf{0}_{N \times 1}$.

In the development of ridge regression, Introduction to Statistical Learning (p. 215) and Elements of Statistical Learning (p. 64) mention that $\beta_0$ is estimated using $\bar{y} = \dfrac{1}{N}\sum_{i=1}^{N}y_i$ after centering the $\mathbf{X}$ columns, and then each component of $\mathbf{y}$ is centered using $\bar{y}$ prior to performing ridge regression.

Under OLS estimation, $$\hat{\boldsymbol\beta}_{\mathbf{X}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}\text{.}$$

It can be shown that the matrix $$\tilde{\mathbf{X}} = \left(\mathbf{I}_{N \times N}-\dfrac{1}{N}\mathbf{1}_{N \times N}\right)\mathbf{X}$$ centers the columns of $\mathbf{X}$, where $\mathbf{1}_{N \times N}$ is the $N \times N$ matrix of all $1$s, and $\mathbf{I}_{N \times N}$ is the $N \times N$ identity matrix.

I am interested in showing that $\hat{\beta}_0$ - i.e., the first component of $\hat{\boldsymbol\beta}$ - is equal to $\bar{y}$ using these assumptions. I thought maybe a previous question would help, but this deals with the case when $\mathbf{X}$ is right-multiplied, rather than left-multiplied.

Using the above, $$\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} = \mathbf{X}^{T} \left(\mathbf{I}_{N \times N}-\dfrac{1}{N}\mathbf{1}_{N \times N}\right)\mathbf{X}$$

due to that the matrix $\mathbf{I}_{N \times N}-\dfrac{1}{N}\mathbf{1}_{N \times N}$ is symmetric and idempotent.

Let's suppose that $$\mathbf{X} = \begin{bmatrix} \mathbf{1}_{N \times 1} & \mathbf{x}_1 & \cdots & \mathbf{x}_p \end{bmatrix}$$ so that $$\mathbf{X}^{T} = \begin{bmatrix} \mathbf{1}_{N \times 1}^{T} \\ \mathbf{x}_1^{T} \\ \vdots \\ \mathbf{x}_p^{T} \end{bmatrix}\text{.}$$ We also have $$\mathbf{I}_{N \times N}-\dfrac{1}{N}\mathbf{1}_{N \times N} = \begin{bmatrix} 1-\frac{1}{N} & -\frac{1}{N} & \cdots & -\frac{1}{N} \\ -\frac{1}{N} & 1 - \frac{1}{N} & \ddots & -\frac{1}{N} \\ \vdots & \ddots & \ddots & -\frac{1}{N} \\ -\frac{1}{N} & \cdots & -\frac{1}{N} & 1 - \frac{1}{N} \end{bmatrix} $$

As I started thinking about doing the multiplication and calculating an inverse, I think I'm at a dead end here. Any suggestions?

Clarinetist
  • 3,761
  • 3
  • 25
  • 70
  • 1
    Not a duplicate, but the answer is here: https://stats.stackexchange.com/questions/220566/things-that-i-am-not-sure-about-lasso-regression-method/220573#220573 – Matthew Drury Nov 02 '17 at 17:21
  • 1
    Your last matrix is a projection orthogonal to the constant vector and therefore is not invertible. But why complicate things? You know the constant term will be the mean of $y$ without any calculation at all, because since $\mathbf{1}=(1,1,\ldots,1)$ is orthogonal to all the centered $x_i$ (that's what centering does!), the constant term must equal the coefficient in the regression of $y$ against $\mathbf{1}$ alone. – whuber Nov 02 '17 at 17:55
  • I demonstrated it in my answer to [Why does the y-intercept of a linear model disappear when I standardize variables?](https://stats.stackexchange.com/questions/43036/why-does-the-y-intercept-of-a-linear-model-disappear-when-i-standardize-variable/243856#243856), though that question also implied a centering of the DV, so there's the demonstration goes one line beyond what's being asked here. – Firebug Nov 02 '17 at 19:35

2 Answers2

7

By assumption, your design matrix $X$ can be partitioned as $$X = \begin{bmatrix} 1 & \tilde{X} \end{bmatrix},$$ where $\tilde{X} \in \mathbb{R}^{N \times p}$ satisfies $1^T\tilde{X} = \mathbf{0}^T$ due to the centering. Partition $\beta$ into $[\beta_0, \tilde{\beta}^T]^T$ accordingly. Straightforward calculation shows \begin{align*} \hat{\beta} & = \begin{bmatrix}\hat{\beta}_0 \\ \hat{\tilde{\beta}}\end{bmatrix} \\ & = (X^TX)^{-1}X^Ty \\ & = \begin{bmatrix} 1^T1 & 1^T\tilde{X} \\ \tilde{X}^T1 & \tilde{X}^T\tilde{X} \end{bmatrix}^{-1} \begin{bmatrix}1^T \\ \tilde{X}^T\end{bmatrix} y \\ & = \begin{bmatrix}N^{-1} & 0^T \\ 0 & (\tilde{X}^T\tilde{X})^{-1}\end{bmatrix} \begin{bmatrix}1^Ty \\ \tilde{X}^Ty\end{bmatrix} \\ & = \begin{bmatrix} \bar{y} \\ (\tilde{X}^T\tilde{X})^{-1}\tilde{X}^Ty \end{bmatrix}. \end{align*}

Zhanxiong
  • 5,052
  • 21
  • 24
2

Let $\tilde{X}$ be a $N \times p$ matrix with $N$ observations and $p$ features:

$$\mathbf{X} = (1 \quad\tilde{X})= \begin{bmatrix} 1 &x_{11} & x_{12} & \cdots & x_{1p} \\ 1 &x_{21} & x_{22} & \ddots & x_{2p} \\ \vdots &\vdots & \ddots & \ddots & x_{N-1,p} \\ 1 &x_{N1} & \cdots & x_{N,p-1} & x_{Np} \end{bmatrix} $$

Given that columns of $\tilde{X}$ are centred, we have:

$$\sum_{i=1}^{N}x_{ij} = 0 \quad \forall \:j\in {1,...p} $$

$$(\mathbf{X}^T\mathbf{X})= \begin{bmatrix} N &0^T \\ 0 & \Sigma\end{bmatrix}$$

where $\Sigma$ is a matrix of pairwise covariances between columns of $\tilde{X}$

$$\hat{\boldsymbol\beta}=(\mathbf{X}^T\mathbf{X})^{-1}(\mathbf{X}^TY) = \begin{bmatrix} \frac{1}{N} &0^T \\ 0 & \Sigma^{-1}\end{bmatrix}\begin{bmatrix}N\overline{Y} \\ \vdots\end{bmatrix} $$

$$\hat{\boldsymbol\beta}_0=\frac{1}{N}N\overline{Y}=\overline{Y} $$

Moss Murderer
  • 739
  • 4
  • 12