How do you prove that centering predictors doesn't affect the non-intercept terms in $\hat{\beta}$?

Question

It's intuitive, but I'm having a hard time proving it mathematically.

The claim is

\begin{align} \hat{\beta} = (X^TX)^{-1}X^Ty = (L^TL)^{-1}L^Ty \end{align}

where $L$ is $X$ but with all columns, except the column of ones, centered.

\begin{align} L = X - C \\ C = \begin{bmatrix} 0_n & \mu_2 * 1_n & \ldots & \mu_p * 1_n \end{bmatrix} \\ \mu_i = \frac{1}{n}\sum_{j=1}^n X_{ji} \\ L^TL = (X - C)^T(X-C) \\ = (X^T - C^T)(X - C) \\ = X^TX - X^TC - C^TX + C^TC \\ \end{align}

I stopped here because I know this is going to get very messy. Is there a simpler mathematical approach to showing this?

Are you asking about whether it affects $\hat\beta$ or $X\hat\beta$? — user257566, Apr 09 '21 at 15:36
@user257566 I'm asking for a mathematical proof of why it doesn't affect $\hat{\beta}$ . Isn't that clear from the title? If not I can change it. — student010101, Apr 09 '21 at 15:49
Apologies, I thought that it wasn't true for $\hat\beta$ but was true for the prediction. A quick way to show is just by rewriting the expressions, shown below. — user257566, Apr 09 '21 at 16:19
But it *will*. The design matrix $X$ includes a column for the intercept term, hence the first entry of $\hat{\beta}$ will be the intercept./ — AdamO, Apr 09 '21 at 16:35
@AdamO I edited the OP to reflect that I'm only referring to the *non-intercept* terms in $\hat{\beta}$. — student010101, Apr 09 '21 at 16:37

user257566 · Accepted Answer · 2021-04-09T20:13:32.933

4

Notation

Write $X$ as the design matrix without the intercept and the centering matrix $C = I - 1 1^T/n$ so that your $L=CX$. Further, define the original OLS estimators $$(\hat\beta_0, \hat\beta) = \arg\min_{\beta_0,\beta}\|y-\beta_0 1 - X\beta\|^2$$ and the ``new'' OLS estimators as $$(\tilde\beta_0, \tilde\beta) = \arg\min_{\beta_0,\beta}\|y-\beta_0 1 - CX\beta\|^2.$$ Our task is to show that $\tilde\beta = \hat\beta$.

Core calculation

Notice that \begin{align} \|y-\beta_0 1 - CX\beta\|^2 = & \|y-\beta_0 1 - X\beta + \left(1 1^T/n\right)X\beta\|^2 \\ = & \|y- \left( \beta_0 - \frac{1^T X \beta}{n}\right) 1 - X\beta \|^2 \end{align} since $\left(1 1^T/n\right)X\beta = 1^T X \beta/n \, 1$ is an intercept term. At this point the answer can be read off, but below there are more details.

Details for the curious

Therefore, by definition of $\tilde\beta_0$ and $\tilde\beta$, it follows that \begin{align} \|y- \left( \tilde\beta_0 - \frac{1^T X \tilde\beta}{n}\right) 1 - X\tilde\beta \|^2 & \leq \|y- \left( \beta_0 - \frac{1^T X \beta}{n}\right) 1 - X\beta \|^2 \, \text{for all } \beta_0,\beta. \end{align} Further, since for any $\beta$, we can choose $\beta_0$ so that $\beta_0 - \frac{1^T X \beta}{n}$ is any number, it follows that we can reexpress the intercept on the RHS so that \begin{align} \|y- \left( \tilde\beta_0 - \frac{1^T X \tilde\beta}{n}\right) 1 - X\tilde\beta \|^2 & \leq \|y- \beta_0 1 - X\beta \|^2 \, \text{for all } \beta_0,\beta, \end{align} which shows that $\tilde\beta_0 - \frac{1^T X \tilde\beta}{n}$ and $\tilde\beta$ are OLS estimators for the intercept and slopes, respectively, of the original problem. In the case that the augmented design matrix $[1 \, X]$ is full rank, the OLS estimator is unique and $\hat\beta_0 = \tilde\beta_0 - \frac{1^T X \tilde\beta}{n}$ and $\hat\beta = \tilde\beta$.

edited Apr 09 '21 at 20:13

answered Apr 09 '21 at 16:18

user257566

724
4
14

This is great! I didn't think about starting from the original objective, but it's pretty clear when using this approach. Do you know a simple way to manipulate the analytical expression $\hat{\beta} = (X^TX)^{-1}X^Ty$ to show the same thing? – student010101 Apr 09 '21 at 16:37
I'm unfortunately not sure. I think you'd have to do a blockwise decomposition of the inverse gram matrix $(X^TX)^{-1}$. It's more complicated IMO. – user257566 Apr 09 '21 at 16:48
Is it readily observed that $\hat{\beta}_0 = \tilde{\beta}_0 - \frac{1^TX\tilde{\beta}}{n}$ and $\hat{\beta} = \tilde{\beta}$ without differentiating, or did you just skip that part of the derivation? – user5965026 Apr 09 '21 at 16:56
@user5965026 It's just from arithmetic (rewriting the last summand), I've included an edit which says that. Please let me know if it's not more clear. – user257566 Apr 09 '21 at 18:04
Right, I understand that (not sure if we're referring to the same thing here). What I was asking about is how do you tell that $\tilde{\beta} = \hat{\beta}$ from simply looking at the cost functions? Wouldn't you have to take partial derivatives, and find expressions for $\tilde{\beta}$ and $\hat{\beta}$ and see that they're identical? – user5965026 Apr 09 '21 at 18:10
@user5965026 Oh, I see. Thanks for elaborating. I basically did skip it, but you don't need to take derivatives. The key idea is that for any $\beta$ there exists some $\beta_0$ so that the intercept in the last line can equal anything. I'll add a sentence or two in a few hours writing that mathematically. – user257566 Apr 09 '21 at 18:12
Just saw your updated answer. Why do you refer to $(11^T/n)X\beta$ as an "intercept" term? – student010101 Apr 09 '21 at 19:40
@student010101 Intercepts have the property that they have a constant contribution to the mean for each observation, i.e. don't depend on covariates. Because that term is some number (i.e. $1^T X \beta/n$) times the vectors of 1s, it too is constant for each observation. – user257566 Apr 09 '21 at 20:14
@student010101 I saw you mentioned the analytical expression, so I added an auxiliary (lengthy) answer doing that. It's based on simple Algebra, but block matrix inversion is necessary. – Firebug Apr 09 '21 at 20:28

Firebug · Answer 2 · 2021-04-09T20:37:27.240

The claim:

$$\hat{\beta} = (X^TX)^{-1}X^Ty = (L^TL)^{-1}L^Ty$$

We have that both $X$ and $L$ are concatenations of a column of ones and the rest of the predictors:

\begin{cases} X=\matrix{[\mathbf {1}_n & X^*]}\\ L=\matrix{[\mathbf {1}_n & L^*]} \end{cases}

$L^*$ is given in terms of $X^*$:

$$L^*=X^*-\frac{\mathbf {1_n1_n}^T}{n} X^*=\overbrace{\left(\mathbb I_n - \frac{\mathbf {1_n1_n}^T}{n}\right)}^CX^*=CX^*$$

So

$$L=\matrix{[\mathbf {1}_n & CX^*]}$$

$$ (L^TL)^{-1}= \left(\matrix{\left[\matrix{\mathbf {1}_n^T \\ X^{*T}C}\right]}\matrix{[\mathbf {1}_n & CX^*]}\right)^{-1}\\ =\left(\matrix{ \mathbf {1}_n^T\mathbf {1}_n & \mathbf {1}_n^TCX^*\\ X^{*T}C\mathbf {1}_n & X^{*T}C^2X^*}\right)^{-1} $$

Notice however that $\mathbf {1}_n^TCX^* = \mathbf 0_p^T$, a row matrix of zeros. This is easy to see, because multiplying by the row matrix of ones results in the column-wise sums of a matrix (we used this fact to build $C$), but each column in $CX^*$ sums to 0. Also, notice that $\mathbf {1}_n^T\mathbf {1}_n = n$. And lastly $C^2 = C$ (i.e., centering a matrix twice has no additional effect): $$C^2 = \left(\mathbb I_n - \frac{\mathbf {1_n1_n}^T}{n}\right)^2\\ =\left(\mathbb I_n - \frac{\mathbf {1_n1_n}^T}{n}\right)\left(\mathbb I_n - \frac{\mathbf {1_n1_n}^T}{n}\right)\\ =C - \frac{\mathbf {1_n1_n}^T}{n} + \frac{\mathbf {1_n}\color{red}{\mathbf {1_n}^T\mathbf {1_n}}\mathbf {1_n}^T}{\color{red}{n}\cdot n}\\ =C$$ Putting these together:

$$ (L^TL)^{-1} =\left(\matrix{ n & \mathbf 0_p^T\\ \mathbf 0_p & X^{*T}CX^*}\right)^{-1} $$

By block inversion we can write the following:

$$\left(\matrix{ n & 0 \\ 0 & B} \right)^{-1}= \left(\matrix{ n^{-1} & 0 \\ 0 & B^{-1}} \right) $$

Substituting $A = \mathbf {1}_n^TCX^*$, $B = X^{*T}CX^*$:

$$ (L^TL)^{-1} =\left(\matrix{ n^{-1} & \mathbf 0_p^T\\ \mathbf 0_p & (X^{*T}CX^*)^{-1}}\right) $$

Our coefficients become:

$$\hat{\beta_L} = (L^TL)^{-1}L^Ty\\ =\left(\matrix{ n^{-1} & \mathbf 0_p^T\\ \mathbf 0_p & (X^{*T}CX^*)^{-1}}\right) \left[\matrix{\mathbf {1}_n^T \\ X^{*T}C}\right]y\\ = \left[\matrix{n^{-1}\mathbf {1}_n^T \\ (X^{*T}CX^*)^{-1}X^{*T}C}\right]y $$

Similarly for $X$:

$$ (X^TX)^{-1}= \left(\matrix{\left[\matrix{\mathbf {1}_n^T \\ X^{*T}}\right]}\matrix{[\mathbf {1}_n & X^*]}\right)^{-1}\\ =\left(\matrix{ n & \mathbf {1}_n^TX^*\\ X^{*T}\mathbf {1}_n & X^{*T}X^* }\right)^{-1} $$

Block inversion leads us to

$$ (X^TX)^{-1}= \left(\matrix{ A & B \\ C & D }\right) $$

\begin{cases} A = n^{-1}+n^{-2} \mathbf {1}_n^TX^*\color{red}{(X^{*T}X^*-n^{-1}X^{*T}\mathbf {1}_n\mathbf {1}_n^TX^*)}^{-1}X^{*T}\mathbf {1}_n\\ B = -n^{-1}\mathbf {1}_n^TX^*\color{red}{(X^{*T}X^*-n^{-1}X^{*T}\mathbf {1}_n\mathbf {1}_n^TX^*)}^{-1}\\ C = -n^{-1}\color{red}{(X^{*T}X^*-n^{-1}X^{*T}\mathbf {1}_n\mathbf {1}_n^TX^*)}^{-1}X^{*T}\mathbf {1}_n\\ D = \color{red}{(X^{*T}X^*-n^{-1}X^{*T}\mathbf {1}_n\mathbf {1}_n^TX^* )}^{-1} \end{cases}

This isn't necessarily the nightmare it appears to be. Notice the terms in red. They repeat. The part in blue below is exactly $C$!

$$\left(X^{*T}X^*-X^{*T}\frac{\mathbf {1}_n\mathbf {1}_n^T}{n}X^* \right)\\ =X^{*T}\left(\color{blue}{\mathbb I_n-\frac{\mathbf {1}_n\mathbf {1}_n^T}{n} }\right)X^*=X^{*T}CX^*$$

Substituting it back into $A,B,C,D$ so we don't lose track:

\begin{cases} A = n^{-1}+n^{-2} \mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T}\mathbf {1}_n\\ B = -n^{-1}\mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}\\ C = -n^{-1}(X^{*T}CX^*)^{-1}X^{*T}\mathbf {1}_n\\ D = (X^{*T}CX^*)^{-1} \end{cases}

The coefficients are then:

$$\hat{\beta_X} = (X^TX)^{-1}X^Ty\\ =\left(\matrix{ n^{-1}+n^{-2} \mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T}\mathbf {1}_n & -n^{-1}\mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1} \\ -n^{-1}(X^{*T}CX^*)^{-1}X^{*T}\mathbf {1}_n & (X^{*T}CX^*)^{-1} }\right) \left[\matrix{\mathbf {1}_n^T \\ X^{*T}}\right]y \\= \left[\matrix{ (n^{-1}+n^{-2} \mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T}\mathbf {1}_n)\mathbf {1}_n^T - n^{-1}(\mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T}) \\ -n^{-1}(X^{*T}CX^*)^{-1}X^{*T}\mathbf {1}_n\mathbf {1}_n^T + (X^{*T}CX^*)^{-1}X^{*T} }\right]y \\= \left[\matrix{ n^{-1}\mathbf {1}_n^T+n^{-1} (\color{green}{\mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T}})\frac{\mathbf {1}_n\mathbf {1}_n^T}{n} - n^{-1}(\color{green}{\mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T}}) \\ (X^{*T}CX^*)^{-1}X^{*T}\left(\color{blue}{\mathbb I_n - \frac{\mathbf {1}_n\mathbf {1}_n^T}{n}}\right) }\right]y \\= \left[\matrix{ n^{-1}\mathbf {1}_n^T+n^{-1} (\mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T})\color{blue}{\frac{\mathbb I_n - \mathbf {1}_n\mathbf {1}_n^T}{n}} \\ (X^{*T}CX^*)^{-1}X^{*T}C }\right]y \\= \left[\matrix{ n^{-1}\mathbf {1}_n^T+n^{-1} (\mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T})C \\ (X^{*T}CX^*)^{-1}X^{*T}C }\right]y$$

Now compare:

\begin{matrix} \hat{\beta_L} = \left[\matrix{n^{-1}\mathbf {1}_n^T \\ (X^{*T}CX^*)^{-1}X^{*T}C}\right]y & \hat{\beta_X} = \left[\matrix{ n^{-1}\mathbf {1}_n^T+n^{-1} (\mathbf {1}_n^TX^*(X^{*T}CX^*)^{-1}X^{*T})C \\ (X^{*T}CX^*)^{-1}X^{*T}C }\right]y \end{matrix}

+1, thanks for carefully working this out – user257566 Apr 09 '21 at 21:09 — user257566, Apr 09 '21 at 21:09

How do you prove that centering predictors doesn't affect the non-intercept terms in $\hat{\beta}$?

2 Answers2

Notation

Core calculation

Details for the curious

Linked