One way to understand that (1) and (2) accomplish the same thing is to recognize that adding a column with a single $1$ at row $i$ (which is (2)) enables you to column-reduce the model matrix $X$, zeroing out all its entries in row $i$ (which is (1)). Thus the two approaches are really the same model for the leave-one-out dataset. Since the reparameterization accomplished by the column reduction does not affect the original variables, the coefficients associated with the original variables must be the same.
The calculations are easily done using block matrix notation. To keep focused on the concepts, let's suppose that any matrices we wish to invert are, in fact, invertible. Then
$$\hat \beta = (X^\prime X)^{-1} X^\prime y$$
is the original estimate. Without any loss of generality (we may always permute the rows of $X$ and $y$ in parallel), assume $i=n$, so that we're contemplating deleting the last observation. Bordering $X$ on the right by $u_i = u_n$ gives the $n\times (p+1)$ block matrix
$$X_0 = \pmatrix{X & u_n} = \pmatrix{X_{(n)} & 0 \\ 0 & 1} \pmatrix{\mathbb{I}_{p} & 0 \\ x_n & 1} = \pmatrix{X_{(n)} & 0 \\ 0 & 1} W.$$
I have written $X_{(n)}$ for $X$ with its last row removed, $x_n$ for the last row, and $\mathbb{I}_{p}$ for the $p$ identity matrix. (Later on, I will also write $y_{(n)}$ for the response with component $n$ removed.) $W$ is, in effect, a record of the column operations needed to put $X_0$ into the block-diagonal form seen at the right. Its inverse is obtained by negating the $x_n$ terms in the bottom left.
The new estimates based on $X_0$ are
$$\pmatrix{\hat\beta_0 \\ \hat\theta} = (X_0^\prime X_0)^{-1} X_0^\prime y.$$
Substituting $\pmatrix{X_{(n)} & 0 \\ 0 & 1} W$ for $X_0$ throughout gives
$$\pmatrix{\hat\beta_0 \\ \hat\theta} = \pmatrix{(X_{(n)}^\prime X_{(n)})^{-1}X_{(n)}^\prime & 0 \\ -x_n (X_{(n)}^\prime X_{(n)})^{-1}X_{(n)}^\prime & 1} y.$$
By comparing the first $p$ coefficients it is immediate that
$$\hat \beta_0 = (X_{(n)}^\prime X_{(n)})^{-1}X_{(n)}^\prime y_{(n)} = \hat \beta_{(n)},$$
the estimate obtained upon removing the last observation, QED.
Incidentally, since $x_n (X_{(n)}^\prime X_{(n)})^{-1}X_{(n)}^\prime y_{(n)} = x_n \hat \beta_{(n)} = \hat y_n$ is the prediction for $x_n$ based on all the other data,
$$\hat \theta = y_n - x_n\hat\beta_{(n)} = y_n - \hat y_n = e_n,$$
the residual. This is intuitively clear: the new variable gives us the flexibility to achieve a perfect fit of $y_n$, which is achieved by subtracting its residual.