5

I'm taking a class on survey sampling and I have some problem understanding the R implementation of simple random sampling (SRS). Please look at this piece of code:

library(survey)
data(api)

N <- nrow(apipop)
srs_design <- svydesign(id=~1, fpc=~fpc, data=apisrs)
a <- svymean(~api00+api99+I(api00^2)+I(api99^2), srs_design)

a
#         mean     SE
# api00 656.59 9.2497
# api99 624.68 9.5003

svycontrast(a, quote(api00 - api99))
#          nlcon     SE
# contrast  31.9 2.0905

I have no idea how the standard error of the estimation of api00 - api99 is computed. In other words, I don't know which formula svycontrast(a, quote(api00 - api99)) is estimating, $\hat{\overline{api00}} - \hat{\overline{api99}}$ or $\hat{\overline{api00 - api99}}$?


Here is my effort in answering this question:

After some searching, I came up with an idea: svycontrast constructes a new random variable before computing the value. Say, there are two random variables in interest, $X$ and $Y$. To estimate $X - Y$, it'll construct a new random variable by letting $Z = X - Y$ and $z_i = x_i - y_i$ for each observation, so that it can estimate $Z$ by averaging $z_i$, which is a SRS sample of $Z$.

But if I'm right, why is the following expression giving different results? Averaging $z_i = x_i^2 - y_i^2$ should give me the same estimation, but they are different.

svycontrast(a, quote(api00^2 - api99^2))
#          nlcon     SE
# contrast 40872 2672.7

svycontrast(a, quote(`I(api00^2)` - `I(api99^2)`))
#          nlcon     SE
# contrast 39906 2589.1

In this case, which formula is being estimated?

  • $\hat{\overline{api00}^2} - \hat{\overline{api99}^2}$, or
  • $\hat{\overline{api00^2}} - \hat{\overline{api99^2}}$, or
  • $\hat{\overline{api00^2 - api99^2}}$, or
  • something else?

To sum up, my question is: how does svycontrast(stat, contrasts, ...) perform estimations on contrasts, and which formula is it estimating?

nalzok
  • 1,385
  • 12
  • 24

1 Answers1

4

svycontrast computes "linear or nonlinear contrasts of estimates produced by survey functions (or any object with coef and vcov methods)."

That is, it takes the estimates that it is given and computes functions of them. It does not do anything with the individual data -- it does not even see the individual data (in general).

When you do

svycontrast(a, quote(api00^2 - api99^2))

you are asking for the difference between the estimate named api00, squared, and the estimate named api99, also squared, and you get (with one more digit than is printed)

> 656.585^2 - 624.685^2
[1] 40872.51

This is the difference in the squares of the means of the estimates.

If you do

svycontrast(a, quote(`I(api00^2)` - `I(api99^2)`))

you are asking for a linear contrast: the estimate named I(api00^2) minus the estimate named I(api99^2). The answer is

> 448697.875-408791.545
[1] 39906.33

Because this is a linear contrast, you could also get it with

> svycontrast(a, c(0,0,1,-1))
         contrast     SE
contrast    39906 2589.1

The tricky part is standard errors. The standard errors are computed by the delta method. That is, if the covariance matrix of the input estimates is $V$, and the input estimate vector is $\alpha$ and the output estimate vector is $\gamma$, the estimated variance of $\hat\gamma$ is $$\widehat{\mathrm{var}}[\hat\gamma] = \frac{\partial\gamma}{\partial\alpha}^TV\frac{\partial\gamma}{\partial\alpha}$$

For a linear contrast, the derivative is just the vector of coefficients. For a nonlinear contrast, the expression you give is symbolically differentiated.

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73