By minimizing the mse, we get that the unique solution to this optimizing problem is $\beta = (X^TX)^{-1}X^Ty$. Why is its variance matrix $Var(\beta)=(X^TX)^{-1}\sigma^{2}$, where we assume that the observations $y_{i}$ have constant variance $\sigma^{2}$ ?
Asked
Active
Viewed 26 times
1
-
2take a look at page 5 of http://cs229.stanford.edu/summer2020/BiasVarianceAnalysis.pdf. But you can ignore the $\lambda I$ term if you want because that becomes 0 if you're not doing ridge regression. The key idea is that constants 'factor out' of variance or covariance, whether it be a constant scalar times a one-dimensional random variable, or a constant matrix times a random vector ($y$ in your case). In linear regression, the matrix $X$ is considered constant, or you could consider it implicitly conditional, $Var(\beta | X)$. – MathFoliage Feb 27 '21 at 10:40
-
@MathFoliage Thank you very much for the cs229 file and the clarification. – XXX Feb 27 '21 at 12:28