2

Consider the following model.

Assume $(x_i, u_i)$ is sequence of independent identically distributed random vectors in $\mathbf{R}^{d+1}:$

  • $x_i$ are $\mathbf{R}^d$-value random vectors, which will represent the "independent" variables.
  • $u_i$ are random variables that represent the "random disturbances."
  • The index $i$ represents the observation and we assume different observations are independent.
  • We assume that $(x_i, u_i)$ have a common distribution with finite second moment such that $\mathbf{E}(u_i x_i) = 0,$ but leaving the possibility $\mathbf{E}(u_i) \neq 0$ open.
  • Let $X_n^\intercal = [x_1, \ldots, x_n]$ be the "data matrix" of type $(n, d)$ ($n$ "rows" and $d$ "columns") filled with the "independent" variables and $v_n = [u_1, \ldots, u_n]^\intercal$ be the "vector of disturbances" or "random error." Again, I am interested in the mathematics but if you prefer to call these a different name due to intuition, be my guest, I only care about maths.
  • Assume that $X_n$ has full rank $d.$ Under this assumption, the squared matrix $X_n^\intercal X_n$ (of order $d$) is invertible.

Consider the following linear model $$ y_n = X_n \beta + v_n, $$ where $\beta \in \mathbf{R}^d$ is a vector of parameters to be estimated.

I assume that both $y_n$ and $X_n$ are observed, the task is to estimate $\beta.$ To do this, I will use Ordinary Least Squares (OLS). In other words, I want the vector $\beta \in \mathbf{R}^d$ that minimises the quadratic form $$ \beta \mapsto (y_n - X_n \beta)^\intercal (y_n - X_n \beta). $$ Being this a quadratic form, any $\hat \beta$ that makes its derivative zero will be a global minimiser. Differentiating (w.r.t. $\beta$) gives the so-called "normal equations" $$ 2 X_n^\intercal(y_n - X_n \beta) = 0 $$ which, by virtue of the hypothesis of full rank of $X_n,$ gives a unique minimiser $$ \hat \beta_n = (X_n^\intercal X_n)^{-1} X_n^\intercal y_n. $$ This is the OLS estimate of $\beta$ and obtaining it only requires $X_n$ to have full rank.

Then, $$ \hat \beta_n = (X_n^\intercal X_n)^{-1} X_n^\intercal y_n = \beta + (X_n^\intercal X_n)^{-1} X_n^\intercal v_n. $$ Now, consider $$ X_n^\intercal X_n = [x_1, \ldots, x_n] \begin{bmatrix} x_1^\intercal \\ \vdots \\ x_n^\intercal \end{bmatrix} = \sum_{i = 1}^n x_i x_i^\intercal. $$ Thus, by the Strong Law of Large Numbers (SLLN), we find $$ \dfrac{1}{n} X_n^\intercal X_n \to \Sigma_x \quad \mathrm{a.s.}, $$ and since the function $f \mapsto f^{-1}$ is continuous (from the spaces of invertible linear functions onto itself), we see that $$ n(X_n^\intercal X_n)^{-1} \to \Sigma_x^{-1} \quad \mathrm{a.s.} $$ Next, $$ \dfrac{1}{n} X_n^\intercal v_n = \dfrac{1}{n} \sum_{i = 1}^n u_i x_i \to \mathbf{E}(u_1x_1) \quad \mathrm{a.s.}, $$ again by the SLLN and since the sequence $(u_i x_i)$ is independent and identically distributed. As we assume $\mathbf{E}(u_i x_i) = 0,$ we reach that $\hat \beta_n$ is a sequence of estimators that converges a.s. to $\beta.$

But this baffling me, since I am proving that the sequence of OLS estimators converges almost surely and a fortiori in probability to the "true" value of $\beta.$ Why we stop at convergence in probability? Am I missing something? I suppose that one can redo the proof stated above but only assuming that different observations are only uncorrelated and no longer independent; then my applications of the SLLN will break and probably some control in the dispersion matrix of $x$ or the data matrix $X_n$ allows to rescue the convergence but no longer a.s. but this time definitely only in probability.

P.S. After posting this here and how it was received I think I realised I should continue to use math.stackexchange for mathematical in nature questions as opposed to intuition or reference questions. Apologies if this seems too off-topic.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
William M.
  • 400
  • 1
  • 10
  • 1
    You seem to assume (1) the means of the variables are all zero and (2) that the variables have a defined, finite covariance matrix. Most of this doesn't matter conceptually because OLS focuses on the *conditional* distribution of the response: you just have to rule out the possibility that the explanatory variables might have outliers so extreme as to ruin the OLS estimates. – whuber Jul 08 '21 at 16:33
  • (1) Only assume $u_1x_1$ (hence all of the $u_ix_i$) are random vectors with mean zero; (2) yes I assume that the $x_i$ have finite dispersion matrix; I don't think I assume that for the $u_i.$ – William M. Jul 08 '21 at 16:38
  • You assume at least that the expectation of $u_i x_i$ exists, and without assuming that $u_i$ has expectation zero (or at least that it is somehow centered), your model will not be identifiable. And, please edit to remove the superfluous dimension $q$, and other wise make all assumptions explicit in an edit! – kjetil b halvorsen Jul 08 '21 at 19:05
  • I am rather new to statistics (I am a mathematician), but why is q superfluous? – William M. Jul 08 '21 at 19:07

2 Answers2

1

Yep. The estimator of $\hat{\beta}$ in OLS is a linear estimator. So we can get the SLLN by expressing it as a sample average.

$$\hat{\beta} = \frac{\sum_{i=1}^n(X_i - \bar{X})Y_i}{\sum_{i=1}^n(X_i - \bar{X})^2}$$

with some clever algebra it's not too hard to express the above display as a sample average in the usual form of $\sum{T_i}/n$.

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • I don't mean to be rude but how your answer relates to my question? – William M. Jul 08 '21 at 16:41
  • 1
    SLLN - strong law of large numbers - gives us that the sample mean converges AS to its expectation. – AdamO Jul 08 '21 at 16:45
  • 1
    Did you even read my question? The "independent" variables $x_i$ are random vectors in $\mathbf{R}^q$! And I am not asking whether or not $\hat \beta$ is linear in the observations $y_i,$ but I am asking if my approach is right and if so, why people talk about consistency in the OLS as being covergence in probability when convergence a.s. is so much easier to prove. – William M. Jul 08 '21 at 17:11
  • @WillM. The X aren't random in OLS. – AdamO Jul 08 '21 at 17:33
  • The setting that I am thinking of is that $X$ is random but assumed "given", so you assume the error vector $u$ satisfies $\mathbf{E}(u \mid X) = 0$ and $\mathbf{V}(u \mid X) = \sigma I.$ – William M. Jul 08 '21 at 17:41
  • In any case, I wrote all the hypothesis of the model I am considering. – William M. Jul 08 '21 at 17:49
  • 1
    @AdamO, OLS is an estimation technique that can be applied without regard to how $X$ came about (such as whether $X$ are random or not). So I do not think it is correct to say $X$ are not random in OLS. – Richard Hardy Jul 08 '21 at 18:03
  • @RichardHardy would you mind showing how, if $X$ is random, that the expression of the variance of $\hat{\beta}$ as $\hat{\sigma}/\left(X^TX\right)^{-1}$ is consistent? The law of total variance would give us an additional contribution of variance with $X$ random. – AdamO Jul 08 '21 at 18:17
  • 1
    @AdamO: I believe there is much confusion about this things! As I have understood it (and written at https://stats.stackexchange.com/questions/215230/what-are-the-differences-between-stochastic-and-fixed-regressors-in-linear-regre/417324#417324 and posts linked in there) **for purposes of (exact, finite-sample) inference about the regression coefficients** we can condition on ("assume fixed") the regressors. That doesn't inhibit the regressors from being random! ... – kjetil b halvorsen Jul 08 '21 at 18:38
  • 1
    ... And, for questions about consistency and limit distributions, we cannot evade some assumptions about how the regressors $x$ is changing with $i$, be that in a random or deterministic way. If, for intance $x_i$ is proportional with $i$ (which might happen in a time series context) then that must be taken account of in any proof of consistency. – kjetil b halvorsen Jul 08 '21 at 18:38
  • @kjetilbhalvorsen I think we are using the term OLS with some loss of specificity. I agree that in general *regression* can consider fixed or stochastic regressors as part of a general class of models, with many estimation routines and many variance estimates. However OLS seems to be a codified set of terms and expressions that exclude random predictors. – AdamO Jul 08 '21 at 18:43
  • 1
    @AdamO I am using OLS to mean "ordinary least squares" a.k.a. the parameter $\beta$ that minimises the quadratic form $(y - X \beta)^\intercal (y - X \beta).$ This quadratic form has a unique minimiser when $X$ has full rank and this minimiser is precisely $(X^\intercal X)^{-1} X^\intercal y.$ – William M. Jul 08 '21 at 18:48
  • 1
    @AdamO: But even if we should agree on that, you still need to assume something about how the $x$'s behave when new samples arrive, when discussing asymptotic properties. And outside of designed experiments it seems strange to assume they are deterministic.! So what do you propose? – kjetil b halvorsen Jul 08 '21 at 18:48
  • @kjetilbhalvorsen do we? OP's question has nothing other than fixed $n$. I think there's a striking lack of clarity in it as written, and even you and Will are seemingly disagreeing with me and with each other. I think this needs some clarity (and fewer insults from Will) to get meaningful answers. – AdamO Jul 08 '21 at 18:50
  • @AdamO how am I insulting you? – William M. Jul 08 '21 at 18:51
  • @WillM. do you suppose I don't understand the initialism OLS? Have some tact. – AdamO Jul 08 '21 at 18:51
  • @AdamO I asked you if you read my question because you are squaring the $X_i - \bar{X}$ but $X$ is a random vector. – William M. Jul 08 '21 at 18:52
  • 1
    @AdamO: More clarity would be better ... but still, the question is about asymptotic properties and so how can $n$ be fixed? – kjetil b halvorsen Jul 08 '21 at 18:56
  • 1
    For me, OLS is not denoting a *model*, it is just a numerical algorithm that by itself assumes nothing about randomness or otherwise of the $x$'s – kjetil b halvorsen Jul 08 '21 at 19:06
  • @kjetilbhalvorsen now I'm really confused by your take. When we speak of the asymptotic properties of an estimator, we don't consider observations 1 to n as fixed, and then the next n+1 to infinity to be random. Rather, we consider the actual size of $n$ to be arbitrary, and the frequentist properties of the estimators for $n$ as it gets arbitrarily larger, not sequentially. Lyapunov's triangular array presentation could shed some light https://www.stat.berkeley.edu/users/pitman/s205f02/lecture10.pdf. But for the usual OLS presentation, the X is fixed for any $n$. – AdamO Jul 08 '21 at 19:14
  • @WillM. well 1. X may or may not be a random vector in my expression and 2. you can square any quantity random or no and 3. my display is the expression for $\hat{\beta}$ as a linear estimator, from which it's expression as a sample average easily follows. – AdamO Jul 08 '21 at 19:15
1

You are correct: the convergence holds almost surely as well. In this case there's essentially no extra effort in getting an almost sure result.

Now, a question as to why the mathematical statistics community is often happy to work with convergence in probability is fundamentally a sociological question, not a mathematical question, so you can't expect to get a completely mathematical answer.

Most of the time (with a few important exceptions) statistics is happy with convergence in probability. Some contributing reasons:

  • it's true under weaker conditions, especially as regards independence
  • the proofs are simpler
  • the asymptotic conclusions of most interest are about the behaviour of a single large $n$ rather than the 'infinitely often' behaviour of a whole sequence, in part because the asymptotics is often used to reassure about the behaviour of estimators for a single $n$
  • when almost sure convergence is useful as a step in a proof it can often be obtained using (Skorohod/Wichura/Dudley) almost sure representation theorems

There are definitely exceptions, both of sub-fields where almost-sure properties are important and of individuals who are interested in almost sure results, but it's also true that 'in probability' is often enough.

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73
  • Thank you, my question definitely goes along these lines. I put the OLS example but so far, I have always seen that the MLE estimators (for Poisson, Binomial, Geometric, Normal, etc) converge a.s. to the "true" value of the parameter, yet the results always state them only as convergence in probability which, being a mathematician, causes confusion. Why be content with a weak result when no effort (beyong accepting SLLN) gives a much more useful result (for convergence in probability need not imply that the sequence you found actually converges to the "true" value while a.s. does). – William M. Jul 09 '21 at 14:50
  • Statistics is more interested in whether the sequence is close to the truth for a single large $n$ than whether it converges to the truth, because a single (hopefully large enough) $n$ is what we have -- the exceptions, such as sequential analysis or stochastic processes, where the data are sequences, are exactly where a.s. properties are widely studied – Thomas Lumley Jul 10 '21 at 03:25