12

I would like to generate data with "Model 1" and fit them with "Model 2". The underlying idea is to investigate robustness properties of "Model 2". I am particularly interested in the coverage rate of the 95% confidence interval (based on the normal approximation).

  • How do I set the number of iteration runs?
  • Is it true that larger than necessary replications may result in spurious biases? If so, how is that?
Macro
  • 40,561
  • 8
  • 143
  • 148
user7064
  • 1,685
  • 5
  • 23
  • 39
  • What do you mean by "coverage rate of the 95% confidence interval"? If the confidence interval is exact or a good approximate interval it covers the true value of the parameter approximately 95% of the time. – Michael R. Chernick Aug 20 '12 at 14:37
  • 1
    If you're generating a confidence interval based on Model 2 for data generated under Model 1, this seems to indicate the two models are related and contain some of the same parameters. Can you explain a bit more? Also, when you say "spurious" in your second bullet point do you mean wrong or just unimportant? Larger numbers of simulations shouldn't produce bias but it could reveal a bias that has little practical importance that you wouldn't see with a smaller number, similar to how you can detect (i.e. get statistical significance for) a very tiny effect when you have a very large sample size. – Macro Aug 20 '12 at 14:40
  • @Michael Chernick: Under-coverage, for example, may be achieved if the standard error is too small. I have edited my question to specify than I use confidence intervals based on the normal approximation. – user7064 Aug 20 '12 at 14:47
  • @Macro: "Model 1" generates normal data with heteroscedastic error terms and "Model 2" is the standard linear model. – user7064 Aug 20 '12 at 14:47

3 Answers3

11

I often use the width of confidence intervals as a quick-and-dirty way to determine the number of iterations needed.

Let $p$ be the true coverage rate of the 95 % confidence interval when data from "Model 1" is fitted to "Model 2". If $X$ is the number of times that the confidence interval covers the true parameter value in $n$ iterations, then $X\sim {\rm Bin}(n,p)$.

The estimator $\hat{p}=X/n$ has mean $p$ and standard deviation $\sqrt{p(1-p)/n}$. For large $n$, $\hat{p}$ is approximately normal and $\hat{p}\pm 1.96\sqrt{\hat{p}(1-\hat{p})/n}$ gives you an approximately 95 % confidence interval for $p$. Since you know (would gess) that $p\approx 0.95$, it follows that the width of this interval is approximately $2\cdot 1.96\sqrt{0.95\cdot 0.05/n}$.

If you think that a confidence interval with width $0.1$ (say) is acceptable, you find the approximate number of iterations $n$ needed for this by solving the equation $$0.1=2\cdot 1.96\sqrt{0.95\cdot 0.05/n}.$$

In this way you can find a reasonable $n$ by choosing the accuracy that you are looking for.

MånsT
  • 10,213
  • 1
  • 46
  • 65
  • (+1) it looks like we submitted very similar answer at about the same time but I think the different language used may be useful to some. – Macro Aug 20 '12 at 15:06
  • Yes, indeed, i still do not know which answer to accept! Anyway, +1 for both! – user7064 Aug 20 '12 at 15:07
  • 1
    @Macro: +1 to you as well. Variance and interval width are of course more or less equivalent here. Great minds think alike - and so do ours. ;) – MånsT Aug 20 '12 at 15:13
  • @MånsT Am I correct to assume that if my CI width is 0.01 then for the coverage rate of 90% the number of iterations required would be $n=(2\cdot 1.65 \sqrt{0.95\cdot 0.05}/0.01)^2$ for a 95% CI? Let's say this CI is for a proportion estimate. How does the sample size of my binomial model (an then choose quantiles to find CI) affect the coverage probability? – A Gore Feb 16 '16 at 12:45
10

Based on your follow up comment it sounds like you are trying to estimate the coverage probability of a confidence interval when you assume constant error variance when the true error variance is not constant.

The way I think about this is that, for each run, the confidence interval either covers the true value or it doesn't. Define an indicator variable:

$$ Y_i = \begin{cases} 1 & {\rm if \ the \ interval \ covers} \\ 0 & {\rm if \ it \ does \ not } \end{cases}$$

Then the coverage probability you're interested in is $E(Y_i) = p$ which you can estimate by the sample proportion which I think is what you're proposing.

How do I set the number of iteration runs?

We know that the variance of a Bernoulli trial is $p(1-p)$, and your simulations will generate IID bernoulli trials, therefore the variance of your simulation based estimate of $p$ is $p(1-p)/n$, where $n$ is the number of simulations. You can choose $n$ to shrink this variance as much as you want. It is a fact that $$p(1-p)/n \leq 1/4n$$

So, if you want the variance to be less than some pre-specified threshold, $\delta$, then you can ensure this by choosing $n \geq 1/4\delta$.

In a more general setting, if you're trying to investigate properties of the sampling distribution of an estimator by simulation (e.g. it's mean and variance) then you can choose your number of simulations based on how much precision you want to achieve in an analogous fashion to that described here.

Also note that, when the mean (or some other moment) of a variable is the object of interest, as it is here, you can construct a confidence interval for it based on the simulations using the normal approximation (i.e. the central limit theorem), as discussed in MansT's nice answer. This normal approximation is better as the number of samples grows, so, if you plan on constructing a confidence interval by appealing to the central limit theorem, you will want $n$ to be large enough for that to apply. For the binary case, as you have here, it appears this approximation is good even when $np$ and $n(1-p)$ are pretty moderate - say, $20$.

Is it true that larger than necessary replications may result in spurious biases? If so, how is that?

As I mentioned in a comment - this depends on what you mean by spurious. Larger numbers of simulations will not produce bias in the statistical sense, but it may reveal an unimportant bias that is only noticeable with an astronomically large sample size. For example, suppose the true coverage probability of the misspecified confidence interval were $94.9999\%$. Then, this isn't really a problem in a practical sense, but it you may only pick up this difference if you ran a ton of simulations.

Macro
  • 40,561
  • 8
  • 143
  • 148
0

If you are doing a simulation the minimum number of required runs depends on your objective (What are you trying to estimate and with what accuracy?). If you are trying to estimate the average response then the standard deviation of the sample average is the $\dfrac{\text{Population Standard Deviation}}{\sqrt{n}}$. So if $d$ is the required half-width for $95\%$ confidence interval for the mean you want $d= 1.96 \times \dfrac{\text{Pop.Std.Dev}}{\sqrt{n}}$ or $n=\dfrac{ (1.96 \times\text{Pop.Std.Dev})^2}{d^2}$.

Doing more simulations (assuming all samples arre generated by a random process) does nothing to hurt the estimation in terms of accuracy or bias.

The coverage of an approximate confidence interval will differ from the exact $95\%$ coveraged desired and the error in coverage should decrease with increasing $n$. As mentioned by Macro and MansT, you can bound the Monte Carlo estimate of coverage based on the variance of the binomial proportion being $\dfrac{p(1-p)}{n}$.

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • 4
    Hi @Michael. I think this answer misses the point. The OP is trying to investigate how the coverage properties of a confidence interval are changed when you assume constant variance but the true variance is not constant. – Macro Aug 20 '12 at 14:51
  • @Macro: You are right. I deliberately put the question in a broader context to avoid answers that are specific to the problem of assuming constant variance. – user7064 Aug 20 '12 at 14:56
  • @Macro That was not part of the question that I answered. Apparently that was clarified later. It also appears that what was of interest was the accuracy of a confidence interval that uses the normal approximation. This does not seem to be addressed in any of the answers. – Michael R. Chernick Aug 20 '12 at 15:20
  • 4
    @Michael, yes I know - my point was more that you (and I) asked for clarification but you didn't wait for the clarification before posting your answer. Re: your second comment, you can investigate the coverage properties of any interval in this way, regardless of whether it was based on the normal approximation or not. If you think there's something distinct to add that is missed by the existing answers then please edit your answer so we can all learn. – Macro Aug 20 '12 at 15:22
  • @Macro Of course I agree with you. I edited my answer for the benefit of the OP. I suspect that there is nothing in the content that you wouldn't already know. – Michael R. Chernick Aug 20 '12 at 15:26