6

Let $X$ be a $n\times (p+1)$ non-stochastic design matrix. OLS estimator is given by

$$\hat{\beta} = (X'X)^{-1}X' y$$

Thus the variance of the estimator is

$$\text{Var}\left( \hat{\beta}\right) = (X'X)^{-1} \sigma^2\, , $$

where $\text{Var}(y) = I_n \sigma^2$.

My question is, why is it true that variance of estimator decreases as sample size increases? It is not obvious to me what the $i$-th diagonal entry of $(X'X)^{-1}$ is.

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Yuki Kawabata
  • 221
  • 2
  • 4
  • 1
    The result isn't necessarily true, because it depends on how $X$ changes with $n$. Could you tell us what you are proposing? – whuber Nov 07 '17 at 20:44
  • 1
    @whuber I think they are looking for a mathematical proof of std. error of beta hat = variance / sqrt(n)? – Mark White Nov 07 '17 at 21:09
  • @Mark That's true when the only regressor is a constant, because there's no question about what happens to it as more data are collected! However, in any other case, what values are we to give to each new regressor as we collect more data? We can choose sequences in which the variance of a coefficient estimate goes up for a while before it goes down. We can even choose sequences in which the variance oscillates forever--and never converges to zero! – whuber Nov 07 '17 at 21:23
  • 1
    I think he's assuming that you could make infinite draws from a totally stationary dataset, and that all draws would be of sufficiently large $n$ that their variance would be the same. – Josh Nov 07 '17 at 21:49
  • 1
    @Josh That sounds like a particular kind of *stochastic* design matrix. It's a reasonable interpretation, though, and helps make the desired conclusion true! – whuber Nov 07 '17 at 21:56
  • @whuber That's something that professor briefly mentioned during the lecture. I probably will ask him later to clarify this. Maybe I should close the question for now. – Yuki Kawabata Nov 07 '17 at 21:57
  • The estimator of $Var(\hat{\beta})$ is given by $\frac{RSS}{n-K}(X'X)^{-1}$ where RSS is the residual sum of squares. So according to this estimator of the stated variance, the variance decreases as $n$ increases. Does not this help to answer the question? – Snoopy Oct 03 '18 at 21:25

2 Answers2

8

If we assume that $\sigma^2$ is known, the variance of the OLS estimator only depends on $X'X$ because we do not need to estimate $\sigma^2$. Here is a purely algebraic proof that the variance of the estimator decreases with any additional observation if $\sigma^2$ is known. Suppose $X$ is your current design matrix and you add one more observation $x$, which has dimension $1\times (p+1)$. Your new design matrix is $$X_{new} = \left(\begin{array}{c}X \\ x \end{array}\right).$$ You can check that $X_{new}'X_{new} = X'X + x'x$. Using the Woodbury identity we get $$ (X_{new}'X_{new})^{-1} = (X'X + x'x)^{-1} = (X'X)^{-1} - \frac{(X'X)^{-1}x'x(X'X)^{-1}}{1+x(X'X)^{-1}x'} $$ Because $(X'X)^{-1}x'x(X'X)^{-1}$ is positive semi-definite (it is the multiplication of a matrix with its transpose) and $1+x(X'X)^{-1}x'>0$, the diagonal elements of the subtracting term are greater than or equal to zero. So, the diagonal elements of $(X_{new}'X_{new})^{-1}$ are less than or equal to the diagonal elements of $(X'X)^{-1}$.

AAL
  • 81
  • 1
  • 2
  • +1. Thank you for the clear explanation -- and welcome to CV. – whuber Jun 24 '20 at 16:22
  • A careful reader, without enough rep to comment, has called to my attention an error in the application of this identity. Indeed, the denominator of the right hand side of the equality cannot be correct because it does not scale properly upon a simple rescaling of the variables. (Alternatively, consider $X=(1,1)^\prime$ (for $p=0$) and $x=(1).$ It asserts $(2)^{-1}=(2)^{-1} - (2)^{-1}(1)(2)^{-1}/(1+1),$ equivalent to $1/3=3/8.$) The argument is a good one, so it would be nice to get the details right. https://stats.stackexchange.com/a/90921/919 shows what the formula ought to look like. – whuber Dec 28 '21 at 18:36
  • 1
    There was indeed a typo in the denominator: missing a factor of $(X'X)^{-1}$. I have edited the answer accordingly and checked that it fixes the counterexample. Thank you for pointing it out. – AAL Dec 28 '21 at 22:47
5

Assumptions:

(1) There exists a population from which infinite draws of $X$ and $y$ may be made, and each of those draws are characterized by the exact same distribution parameters.

(2) $n$ is sufficiently large that the variance of a sample of length $n$ is always the same, or may be approximated as such.

Let's start out like this:

$\hatβ=({X'X})^{-1}X'y$

$\text{Var}(\hatβ)=\text{Var}[({X'X})^{-1}X'y]$

Now, let the columns of $X$ be mutually orthogonal, each with variance $σ^2$ and mean $0$. $X'X$ is then a $(p+1)$-dimensional diagonal matrix whose elements are $nσ^2$. $({X'X})^{-1}$ is just the element-by-element inversion of the diagonals of $X'X$, that is, a $(p+1)$-dimensional diagonal matrix whose elements are $1/{(nσ^2)}$.

That brings us to

$\text{Var}(\hatβ)=[1/{(nσ^2)}]^2I_{p+1}\text{Var}[X'y]$

$\text{Var}(\hatβ)=[1/{(n^2σ^4)}]I_{p+1}\text{Var}[X'y]$

However, if $y$ is just a univariate response with variance $σ^2$ and mean $0$, then there's no need for the identity matrix in specifying its variance; its variance is a scalar. As specified in the first paragraph, each of the columns of $X$ also has variance $σ^2$ and mean $0$, so the variance of $X'y$ is given by a $(p+1)$-by-$1$ column vector whose elements are $nσ^4$, i.e., $nσ^41_{p+1}$. The presence of the $n$ term seems strange until you realize that we are actually talking about the variance of the sum of $n$ random variables, each with variance $σ^4$ (the product of two random variables each with variance $σ^2$ and mean $0$). That is,

$\text{Var}(\hatβ)=[1/{(n^2σ^4)}]I_{p+1}nσ^41_{p+1}$

So we have a $(p+1)$-by-$(p+1)$ diagonal matrix multiplying a $(p+1)$-by-$1$ vector, each of whose elements are

$\text{Var}(\hatβ_i)=[1/{(n^2σ^4)}]nσ^4=1/n$

Note the absence of $σ^2$, which is due to our specification that all the vectors have the same variance. The summation of the $p+1$ elements of the variance vector therefore scales linearly with $p+1$, which we also expect. This is essentially the variance of $\hat{y}$, which tends to exhibit proportionality to $(p+1)/n$.

Here is a resource I've found useful, and extends this explanation to regularized (ridge) regression.

Josh
  • 1,268
  • 1
  • 13
  • 17
  • 2
    This answer outlines one way in which to make the conclusion true. There are a lot of assumptions here--perhaps it would clarify the exposition if you were to lay them out explicitly. Given that the question specifies that $X$ is "non-stochastic," could you explain the sense(s) in which you are referring to their "variance" in this answer? – whuber Nov 07 '17 at 21:54
  • I added some assumptions. Do you think there are other stipulations that must be made in order for my explanation to hold? – Josh Nov 07 '17 at 21:59
  • 1
    Thank you for undertaking that listing of assumptions. However, it's now almost certain that the columns of $X$ will not be orthogonal! That doesn't ruin your basic idea, but it does indicate that this problem may be more subtle than it might appear. – whuber Nov 07 '17 at 21:59
  • Why not? Unless they're perfectly collinear, they can always be orthogonalized, can't they? – Josh Nov 07 '17 at 22:00
  • 1
    Yes--but that's an extra operation you have to undertake and when you do it, the coefficients no longer have the same meaning from one draw to the next. – whuber Nov 07 '17 at 22:01
  • Hmm, I don't fully understand. Why do my assumptions make it unlikely for the columns of X to be mutually orthogonal? On another note, I think a more rigorous explanation here would show that the $1/n$ relationship tends to hold regardless of orthogonality of the predictors. But that's a little above my pay grade. I do not dispute the unaddressed subtleties of the problem. ;) – Josh Nov 07 '17 at 22:04
  • 1
    Because you have explicitly supposed they are random! If you have a particular joint distribution in mind that guarantees orthogonal columns (for any number of observations $n$), then it would be nice to see it. – whuber Nov 07 '17 at 22:06
  • Fair point. If they're random, then orthogonality can only be guaranteed by employing some operation on the columns, and if they're orthogonal, then it's almost impossible that they're truly random. Is that about right? Then again, I'm pretty sure there is some set of $p$ natural processes tending to exhibit mutual orthogonality. So while $n$ observations of those $p$ processes may not be *exactly* orthogonal, in general their expected correlations will be zero. – Josh Nov 07 '17 at 22:08
  • 1
    That's right. Because of that, in the short run the standard error of any particular coefficient may increase as you add new data (it's easy to produce examples), but *asymptotically* the SEs ought to vary like $n^{-1/2}$. – whuber Nov 07 '17 at 22:13
  • Thanks @whuber for expounding on this, as I am merely an ML practitioner with little knowledge of stats theory. – Josh Nov 07 '17 at 22:15
  • 2
    (Answering questions on this site is a great way to increase that knowledge!) – whuber Nov 07 '17 at 22:16
  • Hence my presence! – Josh Nov 07 '17 at 22:18
  • 2
    I know you've been in this community for some time, but your recent activity suggests a renewed interest--so welcome! – whuber Nov 07 '17 at 22:19
  • @whuber, it occurred to me that my explanation is valid only if each of the $n$ elements of $y$ and the columns of $X$ are themselves random variables, otherwise I don't think the assumption that $\text{Var}[X'y]=nσ^4$ would make any sense. Does this relate to the implicit assumption of a stochastic design matrix (like you mentioned in the comments on the OP)? I did a couple of quick searches for "stochastic design matrix" and didn't get a whole lot of information. I'd appreciate any resources you can recommend for clarification on this point. – Josh Nov 08 '17 at 11:02
  • "Stochastic design matrix" merely means to consider the data as a random sample from some joint distribution of $(X,y)$ rather than as a random sample of $y$ with some fixed or prespecified set of regressors $X$. – whuber Nov 08 '17 at 14:39