Consider the following model.
Assume $(x_i, u_i)$ is sequence of independent identically distributed random vectors in $\mathbf{R}^{d+1}:$
- $x_i$ are $\mathbf{R}^d$-value random vectors, which will represent the "independent" variables.
- $u_i$ are random variables that represent the "random disturbances."
- The index $i$ represents the observation and we assume different observations are independent.
- We assume that $(x_i, u_i)$ have a common distribution with finite second moment such that $\mathbf{E}(u_i x_i) = 0,$ but leaving the possibility $\mathbf{E}(u_i) \neq 0$ open.
- Let $X_n^\intercal = [x_1, \ldots, x_n]$ be the "data matrix" of type $(n, d)$ ($n$ "rows" and $d$ "columns") filled with the "independent" variables and $v_n = [u_1, \ldots, u_n]^\intercal$ be the "vector of disturbances" or "random error." Again, I am interested in the mathematics but if you prefer to call these a different name due to intuition, be my guest, I only care about maths.
- Assume that $X_n$ has full rank $d.$ Under this assumption, the squared matrix $X_n^\intercal X_n$ (of order $d$) is invertible.
Consider the following linear model $$ y_n = X_n \beta + v_n, $$ where $\beta \in \mathbf{R}^d$ is a vector of parameters to be estimated.
I assume that both $y_n$ and $X_n$ are observed, the task is to estimate $\beta.$ To do this, I will use Ordinary Least Squares (OLS). In other words, I want the vector $\beta \in \mathbf{R}^d$ that minimises the quadratic form $$ \beta \mapsto (y_n - X_n \beta)^\intercal (y_n - X_n \beta). $$ Being this a quadratic form, any $\hat \beta$ that makes its derivative zero will be a global minimiser. Differentiating (w.r.t. $\beta$) gives the so-called "normal equations" $$ 2 X_n^\intercal(y_n - X_n \beta) = 0 $$ which, by virtue of the hypothesis of full rank of $X_n,$ gives a unique minimiser $$ \hat \beta_n = (X_n^\intercal X_n)^{-1} X_n^\intercal y_n. $$ This is the OLS estimate of $\beta$ and obtaining it only requires $X_n$ to have full rank.
Then, $$ \hat \beta_n = (X_n^\intercal X_n)^{-1} X_n^\intercal y_n = \beta + (X_n^\intercal X_n)^{-1} X_n^\intercal v_n. $$ Now, consider $$ X_n^\intercal X_n = [x_1, \ldots, x_n] \begin{bmatrix} x_1^\intercal \\ \vdots \\ x_n^\intercal \end{bmatrix} = \sum_{i = 1}^n x_i x_i^\intercal. $$ Thus, by the Strong Law of Large Numbers (SLLN), we find $$ \dfrac{1}{n} X_n^\intercal X_n \to \Sigma_x \quad \mathrm{a.s.}, $$ and since the function $f \mapsto f^{-1}$ is continuous (from the spaces of invertible linear functions onto itself), we see that $$ n(X_n^\intercal X_n)^{-1} \to \Sigma_x^{-1} \quad \mathrm{a.s.} $$ Next, $$ \dfrac{1}{n} X_n^\intercal v_n = \dfrac{1}{n} \sum_{i = 1}^n u_i x_i \to \mathbf{E}(u_1x_1) \quad \mathrm{a.s.}, $$ again by the SLLN and since the sequence $(u_i x_i)$ is independent and identically distributed. As we assume $\mathbf{E}(u_i x_i) = 0,$ we reach that $\hat \beta_n$ is a sequence of estimators that converges a.s. to $\beta.$
But this baffling me, since I am proving that the sequence of OLS estimators converges almost surely and a fortiori in probability to the "true" value of $\beta.$ Why we stop at convergence in probability? Am I missing something? I suppose that one can redo the proof stated above but only assuming that different observations are only uncorrelated and no longer independent; then my applications of the SLLN will break and probably some control in the dispersion matrix of $x$ or the data matrix $X_n$ allows to rescue the convergence but no longer a.s. but this time definitely only in probability.
P.S. After posting this here and how it was received I think I realised I should continue to use math.stackexchange for mathematical in nature questions as opposed to intuition or reference questions. Apologies if this seems too off-topic.