The theory behind the weights argument in R when using lm()

Question

After a year in grad school, my understanding of "weighted least squares" is the following: let $\mathbf{y} \in \mathbb{R}^n$, $\mathbf{X}$ be some $n \times p$ design matrix, $\boldsymbol\beta \in \mathbb{R}^p$ be a parameter vector, $\boldsymbol\epsilon \in \mathbb{R}^n$ be an error vector such that $\boldsymbol\epsilon \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{V})$, where $\mathbf{V} = \text{diag}(v_1, v_2, \dots, v_n)$ and $\sigma^2 > 0$. Then the model $$\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$$ under the assumptions is called the "weighted least squares" model. The WLS problem ends up being to find $$\begin{equation} \arg\min_{\boldsymbol \beta}\left(\mathbf{y}-\mathbf{X}\boldsymbol\beta\right)^{T}\mathbf{V}^{-1}\left(\mathbf{y}-\mathbf{X}\boldsymbol\beta\right)\text{.} \end{equation}$$ Suppose $\mathbf{y} = \begin{bmatrix} y_1 & \dots & y_n\end{bmatrix}^{T}$, $\boldsymbol\beta = \begin{bmatrix} \beta_1 & \dots & \beta_p\end{bmatrix}^{T}$, and $$\mathbf{X} = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} = \begin{bmatrix} \mathbf{x}_{1}^{T} \\ \mathbf{x}_{2}^{T} \\ \vdots \\ \mathbf{x}_{n}^{T} \end{bmatrix}\text{.}$$ Then, observe $\mathbf{x}_i^{T}\boldsymbol\beta\in \mathbb{R}^1$, so $$\mathbf{y}-\mathbf{X}\boldsymbol\beta = \begin{bmatrix} y_1-\mathbf{x}_{1}^{T}\boldsymbol\beta \\ y_2-\mathbf{x}_{2}^{T}\boldsymbol\beta \\ \vdots \\ y_n-\mathbf{x}_{n}^{T}\boldsymbol\beta \end{bmatrix}\text{.}$$ This gives $$\begin{align} (\mathbf{y}-\mathbf{X}\boldsymbol\beta)^{T}\mathbf{V}^{-1} &= \begin{bmatrix} y_1-\mathbf{x}_{1}^{T}\boldsymbol\beta &y_2-\mathbf{x}_{2}^{T}\boldsymbol\beta & \cdots & y_n-\mathbf{x}_{n}^{T}\boldsymbol\beta \end{bmatrix}\text{diag}(v_1^{-1}, v_2^{-1}, \dots, v_n^{-1}) \\ &= \begin{bmatrix} v_1^{-1}(y_1-\mathbf{x}_{1}^{T}\boldsymbol\beta) &v_2^{-1}(y_2-\mathbf{x}_{2}^{T}\boldsymbol\beta) & \cdots & v_n^{-1}(y_n-\mathbf{x}_{n}^{T}\boldsymbol\beta) \end{bmatrix} \end{align}$$ thus giving $$\begin{equation} \arg\min_{\boldsymbol \beta}\left(\mathbf{y}-\mathbf{X}\boldsymbol\beta\right)^{T}\mathbf{V}^{-1}\left(\mathbf{y}-\mathbf{X}\boldsymbol\beta\right) \end{equation} = \arg\min_{\boldsymbol \beta}\sum_{i=1}^{n}v_i^{-1}(y_i-\mathbf{x}^{T}_i\boldsymbol\beta)^2\text{.}$$ $\boldsymbol\beta$ is estimated using $$\hat{\boldsymbol\beta} = (\mathbf{X}^{T}\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{V}^{-1}\mathbf{y}\text{.}$$ This is the extent of the knowledge I am familiar with. I was never taught how $v_1, v_2, \dots, v_n$ should be chosen, although it seems that, judging by here, that usually $\text{Var}(\boldsymbol\epsilon) = \text{diag}(\sigma^2_1, \sigma^2_2, \dots, \sigma^2_n)$, which makes intuitive sense. (Give highly variable weights less weight in the WLS problem, and give observations with less variability more weight.)

What I am particularly curious about is how R handles weights in the lm() function when weights are assigned to be integers. From using ?lm:

Non-NULL weights can be used to indicate that different observations have different variances (with the values in weights being inversely proportional to the variances); or equivalently, when the elements of weights are positive integers $w_i$, that each response $y_i$ is the mean of $w_i$ unit-weight observations (including the case that there are $w_i$ observations equal to $y_i$ and the data have been summarized).

I've re-read this paragraph several times, and it makes no sense to me. Using the framework that I developed above, suppose I have the following simulated values:

x <- c(0, 1, 2)
y <- c(0.25, 0.75, 0.85)
weights <- c(50, 85, 75)

lm(y~x, weights = weights)

Call:
lm(formula = y ~ x, weights = weights)

Coefficients:
(Intercept)            x  
     0.3495       0.2834

Using the framework I've developed above, how are these parameters derived? Here's my attempt at doing this by hand: assuming $\mathbf{V} = \text{diag}(50, 85, 75)$, we have $$\begin{align}&\begin{bmatrix} \hat\beta_0 \\ \hat\beta_1 \end{bmatrix} = \\ &\left(\begin{bmatrix} 1 & 1\\ 1 & 1\\ 1 & 1 \end{bmatrix}\text{diag}(1/50, 1/85, 1/75)\begin{bmatrix} 1 & 1\\ 1 & 1\\ 1 & 1 \end{bmatrix}^{T} \right)^{-1}\begin{bmatrix} 1 & 1\\ 1 & 1\\ 1 & 1 \end{bmatrix}^{T}\text{diag}(1/50, 1/85, 1/75)\begin{bmatrix} 0.25 \\ 0.75 \\ 0.85 \end{bmatrix} \end{align}$$ and doing this in R gives (note that invertibility doesn't work in this case, so I used a generalized inverse):

X <- matrix(rep(1, times = 6), byrow = T, nrow = 3, ncol = 2)
V_inv <- diag(c(1/50, 1/85, 1/75))
y <- c(0.25, 0.75, 0.85)

library(MASS)
ginv(t(X) %*% V_inv %*% X) %*% t(X) %*% V_inv %*% y

         [,1]
[1,] 0.278913
[2,] 0.278913

These don't match the values from the lm() output. What am I doing wrong?

score 6 · Accepted Answer · answered Nov 20 '16 at 02:11

6

The matrix $X$ should be $$ \begin{bmatrix} 1 & 0\\ 1 & 1\\ 1 & 2 \end{bmatrix}, $$ not $$ \begin{bmatrix} 1 & 1\\ 1 & 1\\ 1 & 1 \end{bmatrix}. $$ Also, your V_inv should be diag(weights), not diag(1/weights).

x <- c(0, 1, 2)
y <- c(0.25, 0.75, 0.85)
weights <- c(50, 85, 75)
X <- cbind(1, x)

> solve(t(X) %*% diag(weights) %*% X, t(X) %*% diag(weights) %*% y)
       [,1]
  0.3495122
x 0.2834146

answered Nov 20 '16 at 02:11

mark999

3,180
2
22
31

Thank you for clearing up the incorrect design matrix, especially! I'm quite rusty on this material. So, as one last question, does this mean that $\text{Var}(\boldsymbol\epsilon) = \text{diag}(1/\text{weights})$ in the WLS assumptions? – Clarinetist Nov 20 '16 at 02:14
Yes, although the weights only have to be proportional to 1/variance, not necessarily equal. For example, if you use `weights – mark999 Nov 20 '16 at 02:21

score 4 · Answer 2 · answered Dec 17 '16 at 16:41

To answer this more concisely, the weighted least squares regression using weights in R makes the following assumptions: suppose we have weights = c(w_1, w_2, ..., w_n). Let $\mathbf{y} \in \mathbb{R}^n$, $\mathbf{X}$ be a $n \times p$ design matrix, $\boldsymbol\beta\in\mathbb{R}^p$ be a parameter vector, and $\boldsymbol\epsilon \in \mathbb{R}^n$ be an error vector with mean $\mathbf{0}$ and variance matrix $\sigma^2\mathbf{V}$, where $\sigma^2 > 0$. Then, $$\mathbf{V} = \text{diag}(1/w_1, 1/w_2, \dots, 1/w_n)\text{.}$$ Following the same steps of the derivation in the original post, we have $$\begin{align} \arg\min_{\boldsymbol \beta}\left(\mathbf{y}-\mathbf{X}\boldsymbol\beta\right)^{T}\mathbf{V}^{-1}\left(\mathbf{y}-\mathbf{X}\boldsymbol\beta\right)&= \arg\min_{\boldsymbol \beta}\sum_{i=1}^{n}(1/w_i)^{-1}(y_i-\mathbf{x}^{T}_i\boldsymbol\beta)^2 \\ &= \arg\min_{\boldsymbol \beta}\sum_{i=1}^{n}w_i(y_i-\mathbf{x}^{T}_i\boldsymbol\beta)^2 \end{align}$$ and $\boldsymbol\beta$ is estimated using $$\hat{\boldsymbol\beta} = (\mathbf{X}^{T}\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{V}^{-1}\mathbf{y}$$ from the GLS assumptions.

The theory behind the weights argument in R when using lm()

2 Answers2