3

Let the multiple linear regression model : $\mathbf Y=\mathbf X\mathbf \beta+\mathbf \epsilon,$ where $\epsilon\sim N(\mathbf 0, \sigma^2\mathbf I)$.

Least squares estimates, $ \hat{\mathbf\beta}=(\mathbf X'\mathbf X)^{-1}\mathbf X'\mathbf Y$.

The population variance of the estimator is $$Var(\hat{\mathbf\beta})=\sigma^2(\mathbf X'\mathbf X)^{-1}=\frac{\sigma^2}{n}(\frac{1}{n}\mathbf X'\mathbf X)^{-1}$$.

It is written in a lecture material of a renowned professor (reference: p.16) that:

$Var(\hat{\mathbf\beta})=\frac{\sigma^2}{n}(\frac{1}{n}\mathbf X'\mathbf X)^{-1}$ is $O(\frac{1}{n})$ and the convergence is at the rate of $\frac{1}{\sqrt n}$.

I am not understanding:

(1) How $Var(\hat{\mathbf\beta})=\frac{\sigma^2}{n}(\frac{1}{n}\mathbf X'\mathbf X)^{-1}$ is $O(\frac{1}{n})$? Why is this not $O(1)$, since $n$ cancels out in $Var(\hat{\mathbf\beta})=\frac{\sigma^2}{n}(\frac{1}{n}\mathbf X'\mathbf X)^{-1}$ and only a constant term remains?

(2) How is to calculate the convergence rate, which is here $\frac{1}{\sqrt n}$?

user 31466
  • 1,197
  • 13
  • 31

2 Answers2

2

\begin{align*} \mathrm{Var}\left(\hat{\beta} \right) &= \frac{\sigma^2}{n} \left( \frac{1}{n} X'X \right) ^{-1} \\ &= \frac{\sigma^2}{n} \left( \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i' \right) ^{-1} \end{align*}

And as $n \rightarrow \infty$ you have $\frac{1}{n}\sum_{i=1}^n \mathbf{x}_i \mathbf{x}_i' $ converging in probability (by Kolmogorov's Law of Large Numbers for iid data) to the constant matrix $A = \mathrm{E}[\mathbf{x} \mathbf{x}']$ where $\mathbf{x}$ is a random vector, an observation drawn from the same distribution as your data.

Loosely, you can think of it as:

$$\mathrm{Var}\left(\hat{\beta} \right) \approx \frac{1}{n}\sigma^2 A^{-1}$$

I think what you may have been missing is that there's $n$ embedded in $X'X$.


Notation notes:

$\mathbf{x}_i$ is a column vector $k$ by 1 denoting the $i$th observation. The data matrix is then an $n$ by $k$ matrix:

$$ X = \begin{bmatrix} \mathbf{x}_1' \\ \mathbf{x}_2' \\ \ldots \\ \mathbf{x}_n' \end{bmatrix}$$

I use bold letters to denote vectors.

Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
  • I think you have some typos: $\mathrm{E}[\mathbf{x}_i \mathbf{x}_i']$ is not a particularly meaningful expression in the context in which it is being used here. – cardinal Dec 12 '16 at 03:41
  • @cardinal Could you clarify what you mean? This is a commonly used notation: $\mathbf{x}_i$ is a vector denoting the $i$th observation. $\mathbf{x}_i \mathbf{x}_i'$ is the outer product. – Matthew Gunn Dec 12 '16 at 03:47
  • That is, as $\left( \frac{1}{n}\mathbf X' \mathbf X\right) ^{-1}$ converges in probability to $\mathrm{E}[\mathbf{x}_i \mathbf{x}_i']$ (or eqivalently, $\mathrm{E}[\mathbf X' \mathbf X]$, isn't it?), so it becomes constant and thus $\mathrm{Var}\left(\hat{\beta} \right)$ is $O(1/n)$. It is now clear. Thank you. But can I extract the result of the convergence rate $1/\sqrt n$ from here? – user 31466 Dec 12 '16 at 03:51
  • Matthew: $i$ is the index in your original summation. It is, at least, unconventional to have some indeterminate $i$ floating around in a statement regarding a limit. (And, in any case, in this particular setting, the theory would not even require the expectation of any particular $\mathbf x_i$ to exist.) – cardinal Dec 12 '16 at 03:52
1

If the variance of the estimator $\text{Var}(\hat{\mathbf\beta})=\sigma^2(\mathbf X'\mathbf X)^{-1}$ is $O(1/n)$ (assuming implicitly deterministic regressors or "conditional" on them) then, according to the big-O definition (adjusted for matrices) it must be the case that

$$\lim_{n \to \infty} \left [n\cdot \sigma^2(\mathbf X'\mathbf X)^{-1}\right] $$

exists, it is finite and a non-zero matrix. We have

$$\lim_{n \to \infty} \left [n\cdot \sigma^2(\mathbf X'\mathbf X)^{-1}\right] = \sigma^2 \cdot \lim_{n \to \infty} \left (\frac 1n \mathbf X'\mathbf X\right )^{-1} $$

Part of the standard assumptions of the model is the Grenander conditions that among other things guarantee that the above limit converges to something as we want it to be. So the variance is indeed $O(1/n)$, which also implies that it is not $O(1)$. Multiplying and diving by $n$ does not help to showcase this.

We also have

$$ n\cdot \sigma^2(\mathbf X'\mathbf X)^{-1} = \text{Var}[\sqrt n (\hat \beta-\beta)] $$

and it is from here that the $1/\sqrt n$ "convergence rate" statement comes from. I wonder is it correct since the necessary scaling for the variance is $n$?

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
  • 1
    (+1) In particular, as I tried to softly allude to in the other answer, note that this convergence has *nothing* to do with any expectation (or law of large numbers or central limit theorem). It is simply a limiting statement about a sequence of fixed matrices (since the standard regression treatment is conditional on $X$). For example, in cases where the experimenter explicitly designs $X$, it is easy for a mad scientist to construct cases where the individual $\mathbf x_i$ have quite pathological behavior as a function of $i$, yet still satisfy the desired limit condition. – cardinal Dec 14 '16 at 22:48
  • @cardinal I wonder when your book with all these fascinating pathological cases will come out... :) I hope you will let us know. – Alecos Papadopoulos Dec 14 '16 at 23:24