The mean shifted outlier model

Question

In the OLS setting, the mean shifted outlier model connects two behaviors:

delete the $i^{th}$ observation
adding a variable

In other words, assume the original model, $\omega$ ,is:

$$ Y = X\beta + \epsilon$$

with $\epsilon$ ~ $N_n(0,{\sigma}^2 I$). Next, we add a variable to the model above with all other assumptions remaining the same:

$$ Y = X\beta + \theta u_i+ \epsilon$$

where $\theta$ is the parameter associated with $u_i$. We denote this model as $\Omega$.

Now the claim is if $u_i$ is the $i^{th}$ column from the identity matrix, the modified model above is the same as deleting the $i^{th}$ observation of the original OLS model.

Can anyone prove the following:

$$\hat{\beta_{\Omega}} = \hat{\beta_{\omega}} - \frac{e_i}{1 - h_{ii}} (X^TX)^{-1} {x_i}^T $$

where $e_i$ is the $i^{th}$ element of the residual $e$ of the original model, ${x_i}^T$is the $i^{th}$ row of the matrix $X$, and $h_{ii}$ is the $i^{th}$element of the diagonal of the hat matrix $H$.

this looks like routine bookwork (and so would likely fall under the `self-study` tag [q.v.](http://stats.stackexchange.com/tags/self-study/info)). Is this for a class? Your own study? — Glen_b, Aug 05 '15 at 22:47
@Glen_b This is for a class at stanford. I just figured out. Thanks for asking. — Jack Shi, Aug 05 '15 at 22:55
Perhaps you could give a rough outline of your approach below so that your question has an answer? — Glen_b, Aug 05 '15 at 23:35

Jack Shi · Answer 1 · 2015-08-06T20:47:15.650

To solve this, we have to consider a more general case, i.e, $u_i$ is just a random vector instead of being a column of the identity matrix. Take the expected value of both sides of the equation, we have

$$E(Y) = X\beta + \theta v_i $$

So if we have the design matrix $X$ and response variable $y$ beforehand, we can write

$$y = X \hat{\beta} + \hat{\theta} v_i$$

where $\theta$ is the parameter associated with $v_i$.

Multiply both sides of the equation above by $X^T$ and ${v_i}^T$, we will have

$$X^T y = X^TX\hat{\beta} + X^T \hat{\theta} v_i$$

$${v_i}^T y = {v_i}^TX\hat{\beta} + {v_i}^T\hat{\theta} v_i$$

Solve the equations for $\hat{\beta}$ and $\hat{\theta}$, we will have

$$\hat{\theta} = \frac{{v_i}^T Py}{{v_i}^TPv_i}$$

where P = I - H and

$$\hat{\beta_{\Omega}} = \hat{\beta_{\omega}} - (X^TX)^{-1}X^Tv_i\hat{\theta}$$

Replace $v_i$ with $u_i$, the $i^{th}$ column of the identity matrix, one will get the equation in the question.

Now, why bother doing this? We already have frameworks to deal with add-variable situation and delete-observation situation. But if one investigates in both situations individually, the derivation of the equations above will be much more complicated(including inverse a block matrix). The method explained above is rather simple to understand and it is nice to consider both situations in one model.

Thanks. One query though ... did you mean to have $v_i$ on both sides of this: $v_i = \frac{{v_i}^T Py}{{v_i}^TPv_i}$?? Should the LHS be $\hat{\theta}$ say? Should the $\theta$ in the next equation have a hat? — Glen_b, Aug 06 '15 at 01:49

score 1 · Accepted Answer · answered Aug 06 '15 at 21:58

One way to understand that (1) and (2) accomplish the same thing is to recognize that adding a column with a single $1$ at row $i$ (which is (2)) enables you to column-reduce the model matrix $X$, zeroing out all its entries in row $i$ (which is (1)). Thus the two approaches are really the same model for the leave-one-out dataset. Since the reparameterization accomplished by the column reduction does not affect the original variables, the coefficients associated with the original variables must be the same.

The calculations are easily done using block matrix notation. To keep focused on the concepts, let's suppose that any matrices we wish to invert are, in fact, invertible. Then

$$\hat \beta = (X^\prime X)^{-1} X^\prime y$$

is the original estimate. Without any loss of generality (we may always permute the rows of $X$ and $y$ in parallel), assume $i=n$, so that we're contemplating deleting the last observation. Bordering $X$ on the right by $u_i = u_n$ gives the $n\times (p+1)$ block matrix

$$X_0 = \pmatrix{X & u_n} = \pmatrix{X_{(n)} & 0 \\ 0 & 1} \pmatrix{\mathbb{I}_{p} & 0 \\ x_n & 1} = \pmatrix{X_{(n)} & 0 \\ 0 & 1} W.$$

I have written $X_{(n)}$ for $X$ with its last row removed, $x_n$ for the last row, and $\mathbb{I}_{p}$ for the $p$ identity matrix. (Later on, I will also write $y_{(n)}$ for the response with component $n$ removed.) $W$ is, in effect, a record of the column operations needed to put $X_0$ into the block-diagonal form seen at the right. Its inverse is obtained by negating the $x_n$ terms in the bottom left.

The new estimates based on $X_0$ are

$$\pmatrix{\hat\beta_0 \\ \hat\theta} = (X_0^\prime X_0)^{-1} X_0^\prime y.$$

Substituting $\pmatrix{X_{(n)} & 0 \\ 0 & 1} W$ for $X_0$ throughout gives

$$\pmatrix{\hat\beta_0 \\ \hat\theta} = \pmatrix{(X_{(n)}^\prime X_{(n)})^{-1}X_{(n)}^\prime & 0 \\ -x_n (X_{(n)}^\prime X_{(n)})^{-1}X_{(n)}^\prime & 1} y.$$

By comparing the first $p$ coefficients it is immediate that

$$\hat \beta_0 = (X_{(n)}^\prime X_{(n)})^{-1}X_{(n)}^\prime y_{(n)} = \hat \beta_{(n)},$$

the estimate obtained upon removing the last observation, QED.

Incidentally, since $x_n (X_{(n)}^\prime X_{(n)})^{-1}X_{(n)}^\prime y_{(n)} = x_n \hat \beta_{(n)} = \hat y_n$ is the prediction for $x_n$ based on all the other data,

$$\hat \theta = y_n - x_n\hat\beta_{(n)} = y_n - \hat y_n = e_n,$$

the residual. This is intuitively clear: the new variable gives us the flexibility to achieve a perfect fit of $y_n$, which is achieved by subtracting its residual.

The mean shifted outlier model

2 Answers2