Does multicollinearity increase the variance of the beta for every covariate or just those that are collinear?

Question

I know under typical OLS assumptions $$Var(\hat{\beta}) = [X^TX]^{-1}X^T Var(\epsilon)X[[X^TX]^{-1}]^T$$ If some $X_j$ can be approximately written as a linear cominbation of the other covariates, then multicollinearity is present. Note by definition

$$[X^TX]^{-1} = \frac{1}{\det(X^TX)}\text{adj}(X^TX)$$ where $\text{adj}$ denotes the adjugate. So when multicollinearity is present, do the standard errors increase because
$$\frac{1}{\det(X^TX)}$$ gets very large? If so, does that mean the standard error increase for each $\hat{\beta}_j$ or is it only select ones? If not, why does it get large?

Possible duplicate of [What exactly is model instability due to multicollinearity?](https://stats.stackexchange.com/questions/435986/what-exactly-is-model-instability-due-to-multicollinearity) — PsychometStats, Nov 19 '19 at 19:40
I don't think this is exactly a duplicate since this question leads to a discussion about what parts of $\hat\beta$ get high variance which isn't discussed in that linked answer — jld, Nov 19 '19 at 19:52

jld · Accepted Answer · 2019-11-19T20:06:26.700

I'd vote for singular values/eigenvalues/eigenvectors over determinants and adjugates for the way to approach this.

TLDR: standard errors increase as the eigenvalues of $X^TX$ get increasingly small and this corresponds to the formation of valleys in the loss surface representing our increasing inability to separate out candidate $\hat\beta$ values.

We're looking to minimize $\|y - Xb\|^2$ over $b\in\mathbb R^p$. Let $X = UDV^T$ be the SVD of $X$. As $X$ gets increasingly close to reduced rank we'll have $d_p\to 0$ (at least) where $d_p$ is the smallest singular value. This reflects the fact that $X$ is getting closer and closer to having a non-trivial null space, which would include (at least) $\text{span}(v_p)$, with $v_p$ being the smallest right singular vector or equivalently smallest eigenvector of $X^TX$.

This means that once we've got $\hat\beta$ we could get an almost identical loss by replacing $\hat\beta$ with $\hat\beta + \alpha v_p$ for $\alpha \in \mathbb R$. This shows that there is a whole affine subspace of almost equal loss (at least for modest values of $\alpha$), and as $d_p\to 0$ the loss will become increasingly equivalent over that subspace until we are truly unable to pick an element from it since they all have identical loss.

This is one way to picture high variance: when there are very different values of $b$ leading to an almost identical loss, slight perturbations in the data can lead to very different $\hat\beta$s which is basically what high variance means.

This analysis also tells us that, while some individual coordinates of $\hat\beta$ may get high variance, it's really about the coordinates of $\hat\beta$ expressed w.r.t. the basis given by $V$.

Here's an example. I'll build $X$ by picking $U$, $D$, and $V$.

Let $$ V = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1/\sqrt 2 & 1/\sqrt 2 \\ 0 & 1/\sqrt 2 & -1/\sqrt 2 \end{bmatrix} $$ $$ D = \text{diag}(2, 1.7, .01) $$ and let $U$ be any matrix in $\mathbb R^{n\times 3}$ with orthogonal columns. This leads to $$ (X^TX)^{-1} = VD^{-2}V^T \approx \begin{bmatrix} 1/4 & 0 & 0 \\ 0 & 5000 & -5000 \\ 0 & -5000 & 5000\end{bmatrix} $$ so $\hat\beta_1$ will have a very modest variance but $\hat\beta_2$ and $\hat\beta_3$ have huge variances, and this is because $Xv_3 \approx \mathbf 0$ so $\hat\beta$ can be perturbed along $(0,1,-1)^T$ with only a small change in loss. So it is true that their individual variances get large but I think this is much more fundamental.

I'm confused about one point due to my lack of knowledge. I don't understand why $\hat{\beta}$ can be written as $\hat{\beta}+\alpha v_p$. My thought is this because, as in your example, $X(\hat{\beta} + \alpha v_3) = X\beta + \alpha X v_3$ and since $X v_3 \approx \mathbf{0}$, then $X(\hat{\beta} + \alpha v_3) = X\hat{\beta}+\mathbf{0}= X\hat{\beta}$. Is that the right logic or am I misunderstanding it? — Stan Shunpike, Nov 20 '19 at 00:26
@StanShunpike It's not that $\hat\beta$ is close to $\hat\beta + \alpha v_p$, it's that these two vectors, which are potentially very far apart, start to produce losses that are closer and closer to each other as $X$ becomes closer and closer to being rank deficient. Your calculation is exactly correct, $X(\hat\beta + \alpha v_p)\approx X\hat\beta$ is the main idea, and this can only happen as $X$ gets closer and closer to having a nontrivial null space. But these are all approximately equal, it won't be exactly equal — jld, Nov 20 '19 at 00:35
So while $\lVert y-X(\beta+\alpha v_p)\rVert_2^2$ and $\lVert y-X\beta\rVert_2^2$ would be close, $\lVert \beta+\alpha v_p\rVert_2 $ and $\lVert \beta\rVert_2 $ could be very different in magnitude depending on $\alpha$ and therefore $\beta$ and $\beta +\alpha v_p$ could be very far apart in this sense? — Stan Shunpike, Nov 20 '19 at 00:46
@StanShunpike yep that's it, and the huge standard errors that multicollinearity leads to reflect that big changes in $\hat\beta$ are possible — jld, Nov 20 '19 at 04:21

Does multicollinearity increase the variance of the beta for every covariate or just those that are collinear?

1 Answers1