If $\operatorname{Var}\left(\epsilon_i\right) = h\left(X\right) \neq \sigma^2$, what can we know about $\operatorname{Var}\left(\hat{\beta}\right)$?

Question

This question uses the derivations found here.

The short version

Consider a regression model. If the error variance is a known function of the data (rather than a constant), under what conditions can we draw conclusions about the OLS estimates?

The long version

Notation

Denote:

$X = \left[\matrix{ X_{11} & \dots & X_{1p} \\ \vdots & \ddots & \vdots \\ X_{n1} & \dots & X_{np} \\ }\right]$
$\beta = \left(\beta_1, \dots, \beta_p\right)$
$Y = \left(Y_1, \dots, Y_n\right)$
$\epsilon = \left(\epsilon_1, \dots, \epsilon_n\right)$

Assume:

$Y= X \beta + \epsilon$
$\operatorname{E}\left(\epsilon\,|\,X\right)=0$ so that $E(Y\,|\,X) = X \beta$
$\operatorname{Var}\left(\epsilon\right)$ is diagonal.
$X$ is deterministic so we can drop the "$\left(\cdot\,|\,X\right)$".

Define:

$\hat{\beta}$: the OLS estimate of $\beta$ in the model $Y=X \beta + \epsilon$
$\tilde{\beta}$: an arbitrary competing estimate $\tilde{\beta} = A'Y$
$B = X \left(X'X\right)^{-1}$

Background

We derive $\operatorname{Var}\left(\hat{\beta}\right)$ by assuming that $\operatorname{E}\left(\epsilon\epsilon'\right) = \sigma^2 I$. Then we can conclude that: $$\begin{align} \operatorname{Var}\left(\hat{\beta}\right) &= \left(X'X\right)^{-1} X' \underbrace{\operatorname{E}\left(\epsilon\epsilon'\right)}_{=\sigma^2 I} X \left(X'X\right)^{-1} \\ &= \sigma^2 \left(X'X\right)^{-1} X' X \left(X'X\right)^{-1} \\ &= \sigma^2 \left(X'X\right)^{-1} \\ \end{align}$$

This in turn is used to show that $\hat{\beta}$ is efficient among unbiased estimators: $$\begin{align} \operatorname{Var}\left(\tilde{\beta}\right) - \operatorname{Var}\left(\hat{\beta}\right) &= \sigma^2 A'A - \sigma^2 \left(X'X\right)^{-1} \\ &= \sigma^2 A' M A \\ &\geq 0 \end{align}$$

The question

What if $\operatorname{Var}\left(\epsilon\right) = h\left(X\right)$ for a known function $h$?

This leaves us with $$ \operatorname{Var}\left(\hat{\beta}\right) = B' h\left(X\right) B $$ which is nice, but $$ \operatorname{Var}\left(\tilde{\beta}\right) - \operatorname{Var}\left(\hat{\beta}\right) = A' h\left(X\right) A - B' h\left(X\right) B $$ doesn't tell us anything.

What conditions on $h$ will allow us to learn something about $\operatorname{Var}\left(\hat{\beta}\right)$ and $\operatorname{Var}\left(\tilde{\beta}\right) - \operatorname{Var}\left(\hat{\beta}\right)$? Or (as per AdamO's comment) about the relative efficiency?

For instance, this reduces to generalized least squares when $h(X) = X' \Omega X$. But I'm mainly still interested in the case (as per the assumptions at the beginning) where $h(X)$ is diagonal.

Similarly, consider $$ h\left(X\right) = \left[\matrix{f(X_1 \cdot \beta_1) & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & f(X_p \cdot \beta_p) \\}\right] $$

where $f(z) = z$ (implied if $\epsilon$ is Poisson) or $f(z) \propto z^2$ (implied if $\epsilon$ is lognormal or gamma). This looks suspiciously like iteratively reweighted least squares.

In $\operatorname{E}(\epsilon\epsilon')=\sigma^2$, is the left hand side a matrix and the right hand side a scalar? Also, in $\operatorname{E}(\epsilon\epsilon')=h(X)$, can you still say *observations are still iid*? — Richard Hardy, Mar 27 '15 at 18:48
@RichardHardy whoops, good point. Sloppy notation, I'll edit. — shadowtalker, Mar 27 '15 at 18:54
I wonder what you expect to learn about $\operatorname{Var}\left(\hat{\beta}\right)$ and $\operatorname{Var}\left(\tilde{\beta}\right) - \operatorname{Var}\left(\hat{\beta}\right)$? That the latter is negative semidefinite or the like? Also, I am still not comfortable with *iid* in *the iid case ... where $h(X)$ is diagonal*. But perhaps I am just not getting it. — Richard Hardy, Mar 27 '15 at 19:43
@RichardHardy that would be worth knowing. Can we derive bounds on it? Does it depend on $X$? The motivation for this question was originally "what happens if we fit a GLM with OLS?" but it took on a life of its own once I started writing out equations. — shadowtalker, Mar 27 '15 at 19:46
@AdamO yes, I made substantial edits for accuracy and added that as one of the assumptions. — shadowtalker, Mar 27 '15 at 19:48
@ssdecontrol okay. usually you compare the variance of two estimators with their relative efficiency, i.e. a ratio, which should turn out a lot nicer. — AdamO, Mar 27 '15 at 19:51
@AdamO okay, I was just following the format of the page I linked at the top — shadowtalker, Mar 27 '15 at 19:53
You are writing that "the observations are i.i.d.". Do you mean that the unconditional variance of the error term is common and constant, and it is only conditionally heteroskedastic, or you meant to write that the observations are _only_ "ind.d.", i.e. independently but not identically distributed, in which case you assume that the _unconditional_ variance also differs per observation? The way you treat the estimator variance indicates that what you have in mind is conditional heteroskedasticity, although not explicitly stated. — Alecos Papadopoulos, Mar 27 '15 at 19:57
@AlecosPapadopoulos I mean that the $Y_i$'s have the same parametric form but different parameters. I guess that's more properly called "conditional heteroskedasticity" but that isn't the perspective I originally had. However looking at AdamO's answer it seems that's _precisely_ the right way to think of this setup. — shadowtalker, Mar 27 '15 at 20:02
Statistical inference depends crucially on the assumptions made regarding the stochastic, probabilistic framework of a model. So, ok, "conditional heteroskedasticity" then. — Alecos Papadopoulos, Mar 27 '15 at 20:09
What is the purpose of formulating the condition as $\operatorname{Var}(\epsilon)=h(X)$ rather than the simpler but apparently equivalent condition $\operatorname{Var}(\epsilon)=\Sigma$ (for a given matrix $\Sigma$, evidently computed as $h(X)$)? The latter indicates you're doing generalized least squares. Is there some aspect of your situation you're trying to capture that hasn't made it into your question? — whuber, Oct 08 '17 at 18:52
@ssdecontrol Does that imply anything other than a different notation? May I ask again how, if at all, this might differ from GLS? — whuber, Oct 09 '17 at 13:59

score 8 · Accepted Answer · answered Mar 27 '15 at 19:56

8

It's an easy derivation to show that the least squares estimator:

$$ \hat{\beta} = \left( \mathbf{X}^T\mathbf{X} \right)^{-1} \mathbf{X}^T Y $$

has variance:

$$ \mbox{var} \left(\hat{\beta} \right)= \left( \mathbf{X}^T\mathbf{X} \right)^{-1} \mathbf{X}^T \mbox{var} \left(Y\right)\mathbf{X} \left( \mathbf{X}^T\mathbf{X} \right)^{-1} $$

If $h(X)$ is known then the inverse variance weighted least squares estimator: $(X^T W X)^{-1} X^T W Y$ is unbiased and efficient where $W = diag(h(X)^{-1})$.

The variance of the WLS estimator becomes:

$$ \mbox{var} (\hat{\beta}_{wls}) = (X^T W X)^{-1}$$

It's easy to show that if the mean model is correctly specified the unweighted version of OLS is NOT BIASED. It's NOT BIASED. It's NOT BIASED. -- that always bears repeating as many people don't understand: weighting here only gives you better efficiency.

How much better?

The relative efficiency of the two estimators is not to hard to work out, but WLS is uniformly better. Seber and Lee would have more details if you're interested.

answered Mar 27 '15 at 19:56

AdamO

52,330
5
104
209

Nothing like reinventing the wheel! I never quite understood WLS before now. Also, thanks for the book reference. – shadowtalker Mar 27 '15 at 20:05
@ssdecontrol, was this essentially what you were looking for? – Richard Hardy Mar 27 '15 at 20:06
@RichardHardy yes – shadowtalker Mar 27 '15 at 20:32
2

+1 Well, there's at least a little something more than merely better efficiency; if $h()$ is known, those more efficient weighted estimates also have correct small-sample conditional CIs and PIs and so on. – Glen_b Mar 28 '15 at 03:57
1

@Glen_b I did neglect to mention that, and that's entirely correct. Those conclusions are implied when we state that both OLS and WLS are unbiased estimators (when the mean model is correctly specified), but the WLS is more efficient. – AdamO Mar 30 '15 at 19:41
Thanks AdamO. Is that small-sample correctness of intervals implied by unbiasedness and greater efficiency? I'm not sure that's quite enough to get us there. – Glen_b Mar 30 '15 at 20:42
@Glen_b You're right: we need the added assumption that the conditional $Y$s are normally distributed. I think it's much more compelling to in fact say that, regardless of the distribution of $Y$s the *coverage* of the nominal 95% CIs for WLS is better than those of the OLS, since heteroscedasticity may lead to interval estimates that may be conservative or anticonservative, depending on how the error-trend is constructed. – AdamO Mar 30 '15 at 22:24