Estimating p-value of a very rare event using Monte Carlo

Question

I am doing a Monte Carlo sampling from the null hypothesis in order to estimate a p-value of my data. Normally I would do, say, 1000 Monte Carlo runs, count the number of runs where I get values of my statistic at least as extreme (let's say 3), and report a p-value of p=0.003. However, in this particular case all Monte Carlo runs are very far away from my actual value, so I am pretty confident that the p-value is many orders of magnitude smaller than 0.001.

Can I do the following: calculate the mean $F_0$ and standard deviation $\sigma$ across the values of statistic obtained with Monte Carlo runs; normalize the deviation of my actual statistic $F$ (calculated with the real data) by computing a z-score $z=(F-F_0)/\sigma$; report the z-score as the measure of how extreme my data are? Note that I am getting extreme values (like z=40), so I would like to adopt the language of high energy physics and report that my statistic is "40 sigmas away from expected under the null hypothesis" instead of converting z=40 to an astronomically small p-value (a standard software like Matlab would simply give p=0 in this case).

I guess this only makes sense if one can assume that the null statistic distribution is Gaussian. Looking at my 1000 Monte Carlo runs, the histogram does look pretty much Gaussian, but I am not sure it will withstand a formal normality test.

What is the common practice in such cases? Is there one?

Update. Whuber gave a useful reference in the comments; it seems that the appropriate search terms are "rare events" and "importance sampling". I googled for that and briefly looked through a couple of papers. I am still very confused why nobody even discusses the method I suggested above. If this procedure is inappropriate because one should not rely on the normality assumption, then I don't understand why it is considered OK to rely on it when doing e.g. a t-test. Here is an example. LHC people measure their Higgs statistic, calculate standard error and report that the results are 5 sigmas away from the no-Higgs value. This is fine. But my procedure as outlined above is allegedly not fine. What is the important difference then?

I think in both cases, a better approach is to report confidence intervals for your simulations. Should you consider having a two-sided test, the validity of normality of the samples is not necessary (which mostly concerns about the symmetry of the distribution), provided that an empirical approach can be carried out, i.e. calculating the percentiles of the sample corresponded to your confidence level. And for the repeated experiment, apply the same idea and use **Bonferroni correction** [link](http://en.wikipedia.org/wiki/Bonferroni_correction) for a conservative estimate. — honeychip, Aug 01 '13 at 17:04
To report a confidence interval is a good idea. However, let me reiterate my question. Let's say the average Monte Carlo statistic is 1 and the standard deviation is 0.1. I can report 99% CI as [0.7-1.3]. Now my actual statistic is, say, 2. So I want to report the p-value as corresponding to z=10, which is around e−45. Is this legitimate? — amoeba, Aug 01 '13 at 21:42
See Rick Picard and Brian Williams, *Rare Event Estimation for Computer Models*. The American Statistician Feb 2013, Vol. 67, No. 1, pp 22-32. It's very readable and well explained. But why is it important to report, say, 40 sigmas and not just explain that the result is obviously significant and give the actual effect size? After all, when the effect is that large, 40 sigmas merely says you spent way too much time and money collecting data--you might not want to make that so blatantly obvious :-). — whuber, Aug 01 '13 at 22:31
Thanks a lot for the link, I will take a look and comment here on whether it answered my question. Regarding effect size: would not the effect size in my case be the same 40? So it's reporting the same number, but with a different name; or do you mean something else? I actually spent zero money collecting the data, I am analyzing some publicly available data in this case... :) — amoeba, Aug 01 '13 at 22:38
I'm glad you didn't spend anything :-). The effect size that I referred to is expressed in *the natural units of your data,* such as a difference in proportions, lengths, or whatever, rather than as a standardized score. *That* is the number that is important and interpretable in your study. — whuber, Aug 01 '13 at 22:42
Ah, I see. Still, my question remains (even if it's practically not so important). I took a very brief look at the Picard & Williams 2013; I will read more carefully tomorrow, but do I interpret them correctly as saying that computing a z-score from Monte Carlo runs (as I described in the original question) can be very misleading? And the reason is that the null distribution can have very non-Gaussian tails? What if I look at my Monte Carlo distribution and it does look Gaussian? Would this justify my method? — amoeba, Aug 01 '13 at 22:58
No, because it's impossible to characterize a tail that is 40 SDs out just from 1000 iterations. Picard & Williams explicitly address exactly this issue by showing how you can draw most of the simulated values from relatively far out in the tail. You won't get out to 40 SDs but you might be able to demonstrate that you are more than, say, six or seven SDs into the tail. (That's a p-value around $10^{-9}$ or less.) — whuber, Aug 01 '13 at 23:23
Fair enough. But here is what I do not get then. In a more standard situation (let's say Higgs detection at LHC) there is an observed statistic with a standard error, and a theoretical null value (so variance is on the side of the observed value and not the null, as in my case). Then the significance is calculated by how far the observed value if from the null prediction and the z-score is reported. This is standard practice. But why is it legitimate in this situation, doesn't your argument apply as well? Maybe the experimental distribution is not Gaussian either? — amoeba, Aug 02 '13 at 00:03
The update is very closely related to the thread at http://stats.stackexchange.com/questions/11812/sanity-check-how-low-can-a-p-value-go, which you might like to review. — whuber, Aug 02 '13 at 14:09

Estimating p-value of a very rare event using Monte Carlo

0 Answers0