What are the consequences of “duplicating” a subset of data for OLS?

Question

Suppose I have a sample $\{X_i,Y_i\}_{i=1}^n$. Then the OLS estimator of the slope coefficient is given by $$\hat{\beta}=\frac{Cov(X,Y)}{Var(X)}$$

Now suppose I take my data set and replicate a subset of it, meaning there are $m_i$ copies for each $(X_i,Y_i)$ pairs. How does this affect OLS estimates? I suspect that it is a weighted OLS estimator, but I can't prove it.

It differs from a weighted estimator because it is wildly optimistic about the precision of its estimates. At bottom, the problem is that you are feeding strongly correlated data to a procedure that is justified by an assumption that the errors are completely independent. — whuber, Mar 25 '21 at 14:17
Thanks @whuber for your response. I see your point about the precision of the estimator, but what about the parameter estimator? Is it a weighted OLS estimator on the original sample? — user27808, Mar 25 '21 at 14:23

Zhanxiong · Accepted Answer · 2021-03-25T15:00:53.633

Suppose the original design matrix and response vector are \begin{align*} X = \begin{pmatrix} x_1^T \\ x_2^T \\ \vdots \\ x_n^T \end{pmatrix} \in \mathbb{R}^{n \times p}, \quad y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \in \mathbb{R}^n \end{align*} respectively, then the "duplicated" design matrix and response vector, according to your description, are \begin{align*} X^\dagger = \begin{pmatrix} e_1 \otimes x_1^T \\ e_2 \otimes x_2^T \\ \vdots \\ e_n \otimes x_n^T \end{pmatrix} \in \mathbb{R}^{(m_1 + \cdots + m_n) \times p}, \quad y^\dagger = \begin{pmatrix} e_1 \otimes y_1 \\ e_2 \otimes y_2 \\ \vdots \\ e_n \otimes y_n \end{pmatrix} \in \mathbb{R}^{m_1 + \cdots + m_n}, \end{align*} where $e_i$ is an $m_i \times 1$ column vector consisting of all ones, $i = 1, \ldots, n$, and "$\otimes$" stands for Kronecker product.

Given that, you can apply the OLS formula to calculate the new OLS estimate based on the duplicated data $\{X^\dagger, y^\dagger\}$ (by the way, what you wrote in your question is not the sample-level OLS estimate, the correct one is $\hat{\beta} = (X^TX)^{-1}X^Ty$.): \begin{align*} &\tilde{\beta} = (X^{\dagger T}X^\dagger)^{-1}X^{\dagger T}y^\dagger \\ =& (m_1x_1x_1^T + \cdots + m_nx_nx_n^T)^{-1}(m_1y_1x_1 + \cdots + m_ny_nx_n) \\ =& (X^TWX)^{-1}X^TWy, \end{align*} where $W$ is the diagonal matrix $\mathrm{diag}(m_1, \ldots, m_n)$. In the above calculation, we used two properties of Kronecker product: \begin{align*} & (A \otimes B)^T = A^T \otimes B^T, \\ & (A \otimes B)(C \otimes D) = (AC)\otimes(BD). \end{align*}

So your conjecture is correct -- $\tilde{\beta}$ does have a weighted LS (probably we shouldn't say "weighted OLS", because "weighted" implies it is not "ordinary") form.

This answer appears to obscure the basic simplicity of the result. This regression is not "weighted" except in the trivial sense that *the weights are all equal.* Replicating the data by a factor of $m,$ say, merely multiplies $X^\prime X$ by $m^2$ and multiplies $X^\prime y$ by the same factor, so the OLS solution $(X^\prime X)^{-1}X^\prime y$ is unchanged. — whuber, Mar 25 '21 at 14:39
@whuber The OP does mention each sample point, may be duplicated at different times $m_i$. If every sample point receives the same $m$ duplication, then yes, the OLS doesn't change, which actually can be viewed as a special case of the above answer. — Zhanxiong, Mar 25 '21 at 14:40
I appreciate your clarification. It was unclear to me how to interpret the (ungrammatical) phrase "$m_i$ copies for each $(X_i,Y_i)$ pairs," but your interpretation looks like a good one (+1). — whuber, Mar 25 '21 at 14:47
The case with equal weighting: https://stats.stackexchange.com/questions/216003/what-are-the-consequences-of-copying-a-data-set-for-ols?rq=1 — kjetil b halvorsen, Mar 25 '21 at 16:27

What are the consequences of “duplicating” a subset of data for OLS?

1 Answers1