1

This is a follow-up to this question from a few years ago What are the consequences of "copying" a data set for OLS?. I've been kind of confused about kjetil's statement on variance for the past couple of months.

We know that for a generic $X \in \mathbb{R}^{n \times p}, Y \in \mathbb{R}^{n \times 1}$ and IID uncorrelated errors with variance $\sigma^2$, that $$ \operatorname{var}(Y) = \sigma^2 I_{n \times n} \\ \operatorname{var} \left(\hat{\beta}_{OLS} \right) = \sigma^2(X^T X)^{-1} $$

Consider a specific dataset $X_s \in \mathbb{R}^{n \times p}, Y_s \in \mathbb{R}^{n \times 1}$ where $\operatorname{var}(Y_s) = \sigma_s^2 I_{n \times n}$, we have $$ \operatorname{var}(Y_s) = \begin{bmatrix} \sigma_s^2 & 0 & \ldots & 0 \\ 0 & \ddots & 0 \ldots & 0 \\ \vdots & \ddots & \ddots & 0 \\ 0 & \ldots & 0 & \sigma_s^2 \end{bmatrix} = \sigma_s^2 I_{n \times n}\\ \operatorname{var} \left( \hat{\beta}_{OLS} \right)= \sigma_s^2(X_s^T X_s)^{-1} $$

If $X_d = [X_s \ \ X_s]^T \in \mathbb{R}^{2n \times p}, \ Y_d = [Y_s \ \ Y_s]^T \in \mathbb{R}^{2n \times 1}$ is the dataset with copies, then I believe the variance matrix looks like the following because the error is no longer IID (I should really say no longer independent, but still identically distributed due to symmetry) due to the duplicated dataset

$$ \operatorname{var}(Y_d) = \sigma_s^2 \begin{bmatrix} I_{n\times n} & I_{n\times n} \\ I_{n\times n} & I_{n\times n} \\ \end{bmatrix} $$

which is no longer a diagonal matrix.

I am not seeing where the factor of 2 came from in the accepted answer. The only way I can see it happening is if they plugged in $X = [X_s \ \ X_s]^T, Y = [Y_s \ \ Y_s]^T, \ \sigma^2 = \sigma_s^2$ into the generic variance formula above

$$ \operatorname{var}(\hat{\beta}) = \sigma_s^2([X_s^T \ \ X_s^T] [X_s \ \ X_s]^T )^{-1} \\ = \sigma_s^2(2X_s^T X_s )^{-1} = \frac{\sigma_s^2}{2}(X_s^T X_s ) $$

If this is what was actually done, I don't understand why plugging in $\sigma^2 = \sigma_s^2$ is valid, and furthermore the derivation for $\operatorname{var}(\hat{\beta}) = \sigma^2(X^T X)^{-1}$ was derived using $\operatorname{var}(Y) = \sigma^2 I$, a diagonal matrix, but $\operatorname{var}(Y_d)$ is not diagonal.

If I rederive $\operatorname{var}(Y_d)$ from scratch, I see the following

\begin{align} \operatorname{var} \left(\hat{\beta}_d \right) = \operatorname{var} \left( \left([X_s^T \ \ X_s^T][X_s \ \ X_s]^T \right)^{-1} [X_s^T \ \ X_s^T] [Y_s \ \ Y_s]^T \right) \\ = \operatorname{var} \left( \left(2X_s^TX_s \right)^{-1} 2X_s^TY_s \right) \\ = \operatorname{var} \left( \left(X_s^TX_s \right)^{-1} X_s^TY_s \right) \\ = \left(X_s^TX_s \right)^{-1} X_s^T \operatorname{var} \left( Y_s \right) X\left(X_s^TX_s \right)^{-1} \\ \left(X_s^TX_s \right)^{-1} X_s^T \sigma_s^2 I_{n \times n} X\left(X_s^TX_s \right)^{-1} \\ = \sigma_s^2 \left(X_s^TX_s \right)^{-1} \end{align}

which is the same as the case without duplicates. This doesn't make sense, and all I used is linear algebra to arrive at the final expression, and no where was $\operatorname{var} \left( Y_d \right)$ used.

I think I'm missing something obvious, but I'm not sure what it is.


Edit: So I think I see what is wrong with the derivation I just did. I don't think it makes sense for me to plug in $X = [X_s \ \ X_s]^T$ when deriving the formula for variance. Instead I should just derive it for a generic $X$, i.e.,

\begin{align} \operatorname{var} \left(\hat{\beta} \right) = \operatorname{var} \left( \left(X^T X \right)^{-1} X^T Y \right) \\ = \left(X^T X \right)^{-1} X^T \operatorname{var} \left( Y \right) X \left(X^T X \right)^{-1} \\ \text{STOP} \end{align} if $\operatorname{var} \left( Y \right)$ was diagonal, we could keep going and arrive at $\operatorname{var} \left(\hat{\beta} \right) = \sigma^2(X^TX)^{-1}$, but because it is not diagonal we can no longer arrive at this step.

So for the variance of the estimator computed from the dataset with duplicates, we now plug in the values for $X, Y, var(Y)$, and we see

\begin{align} \operatorname{var} \left(\hat{\beta} \right) = \left(2X_s^TX_s \right)^{-1} [X_s \ \ X_s] \sigma_s^2 \begin{bmatrix} I_{n\times n} & I_{n\times n} \\ I_{n\times n} & I_{n\times n} \\ \end{bmatrix} [X_s^T \ \ X_s^T]^T \left(2X_s^TX_s \right)^{-1} \\ = \left(2X_s^TX_s \right)^{-1} \sigma_s^2 [2X_s^T \ \ 2X_s^T] [X_s \ \ X_s]^T \left(2X_s^TX_s \right)^{-1} \\ = \left(2X_s^TX_s \right)^{-1} \sigma_s^2 4X_s^TX_s\left(2X_s^TX_s \right)^{-1} \\ = \sigma_s^2 (X_s^T X_s)^{-1} \end{align}

...hmm I arrive at the exact same conclusion and now I've accounted for the dependent errors, so I'm now even more confused. I don't see why the variance of the estimator is halved for the duplicated dataset.

24n8
  • 847
  • 3
  • 13
  • Does this answer your question? [What are the consequences of "copying" a data set for OLS?](https://stats.stackexchange.com/questions/216003/what-are-the-consequences-of-copying-a-data-set-for-ols) – kurtosis Aug 25 '20 at 00:16
  • 2
    @kurtosis No, on the contrary, my question is regarding the accepted answer. I linked that post in the first sentence of my post. – 24n8 Aug 25 '20 at 00:33

1 Answers1

2

The distinction is simply that the question you link to is asking about duplicating data but fitting an ordinary regression ("use OLS" - i.e. treating the new values as if they were a new set of values independent of the first), by which lights the variance indeed reduces.

If you treat them as perfectly dependent, as here, then conditionally on the existing data the new data adds no information, so the variance would not then reduce.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • So mathematically, the variance that was computed in the answer to the linked question was from the formula $var(\hat{\beta}) = \sigma^2 (X^{T}X)^{-1}$, which assumes independent errors/samples? – 24n8 Aug 25 '20 at 20:10
  • Yes, that's my understanding of both the question and answer at the link -- and usually what people intend when they talk about replicating rows of data one or more times. – Glen_b Aug 25 '20 at 23:40