2

Suppose I have a sample $\{X_i,Y_i\}_{i=1}^n$. Then the OLS estimator of the slope coefficient is given by $$\hat{\beta}=\frac{Cov(X,Y)}{Var(X)}$$

Now suppose I take my data set and replicate a subset of it, meaning there are $m_i$ copies for each $(X_i,Y_i)$ pairs. How does this affect OLS estimates? I suspect that it is a weighted OLS estimator, but I can't prove it.

user27808
  • 73
  • 5
  • 1
    It differs from a weighted estimator because it is wildly optimistic about the precision of its estimates. At bottom, the problem is that you are feeding strongly correlated data to a procedure that is justified by an assumption that the errors are completely independent. – whuber Mar 25 '21 at 14:17
  • Thanks @whuber for your response. I see your point about the precision of the estimator, but what about the parameter estimator? Is it a weighted OLS estimator on the original sample? – user27808 Mar 25 '21 at 14:23

1 Answers1

2

Suppose the original design matrix and response vector are \begin{align*} X = \begin{pmatrix} x_1^T \\ x_2^T \\ \vdots \\ x_n^T \end{pmatrix} \in \mathbb{R}^{n \times p}, \quad y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \in \mathbb{R}^n \end{align*} respectively, then the "duplicated" design matrix and response vector, according to your description, are \begin{align*} X^\dagger = \begin{pmatrix} e_1 \otimes x_1^T \\ e_2 \otimes x_2^T \\ \vdots \\ e_n \otimes x_n^T \end{pmatrix} \in \mathbb{R}^{(m_1 + \cdots + m_n) \times p}, \quad y^\dagger = \begin{pmatrix} e_1 \otimes y_1 \\ e_2 \otimes y_2 \\ \vdots \\ e_n \otimes y_n \end{pmatrix} \in \mathbb{R}^{m_1 + \cdots + m_n}, \end{align*} where $e_i$ is an $m_i \times 1$ column vector consisting of all ones, $i = 1, \ldots, n$, and "$\otimes$" stands for Kronecker product.

Given that, you can apply the OLS formula to calculate the new OLS estimate based on the duplicated data $\{X^\dagger, y^\dagger\}$ (by the way, what you wrote in your question is not the sample-level OLS estimate, the correct one is $\hat{\beta} = (X^TX)^{-1}X^Ty$.): \begin{align*} &\tilde{\beta} = (X^{\dagger T}X^\dagger)^{-1}X^{\dagger T}y^\dagger \\ =& (m_1x_1x_1^T + \cdots + m_nx_nx_n^T)^{-1}(m_1y_1x_1 + \cdots + m_ny_nx_n) \\ =& (X^TWX)^{-1}X^TWy, \end{align*} where $W$ is the diagonal matrix $\mathrm{diag}(m_1, \ldots, m_n)$. In the above calculation, we used two properties of Kronecker product: \begin{align*} & (A \otimes B)^T = A^T \otimes B^T, \\ & (A \otimes B)(C \otimes D) = (AC)\otimes(BD). \end{align*}

So your conjecture is correct -- $\tilde{\beta}$ does have a weighted LS (probably we shouldn't say "weighted OLS", because "weighted" implies it is not "ordinary") form.

Zhanxiong
  • 5,052
  • 21
  • 24
  • 1
    This answer appears to obscure the basic simplicity of the result. This regression is not "weighted" except in the trivial sense that *the weights are all equal.* Replicating the data by a factor of $m,$ say, merely multiplies $X^\prime X$ by $m^2$ and multiplies $X^\prime y$ by the same factor, so the OLS solution $(X^\prime X)^{-1}X^\prime y$ is unchanged. – whuber Mar 25 '21 at 14:39
  • 1
    @whuber The OP does mention each sample point, may be duplicated at different times $m_i$. If every sample point receives the same $m$ duplication, then yes, the OLS doesn't change, which actually can be viewed as a special case of the above answer. – Zhanxiong Mar 25 '21 at 14:40
  • I appreciate your clarification. It was unclear to me how to interpret the (ungrammatical) phrase "$m_i$ copies for each $(X_i,Y_i)$ pairs," but your interpretation looks like a good one (+1). – whuber Mar 25 '21 at 14:47
  • The case with equal weighting: https://stats.stackexchange.com/questions/216003/what-are-the-consequences-of-copying-a-data-set-for-ols?rq=1 – kjetil b halvorsen Mar 25 '21 at 16:27