14

This question doesn't specifically pertain to R, but I chose to use R to illustrate it.

Consider the code for producing confidence bands round a (normal) qq-line:

library(car)
library(MASS)
b0<-lm(deaths~.,data=road)
qqPlot(b0$resid,pch=16,line="robust")

I'm looking for an explanation of (or alternative a link to a paper/online document explaining) how these confidence bands are constructed (I have seen a reference to Fox 2002 in the help files of R but sadly I don't have this book handy).

My question will be made more precise with an example. Here's how R computes these particular CI's (I've shortened/simplified the code used in car::qqPlot)

x<-b0$resid
good<-!is.na(x)
ord<-order(x[good])
ord.x<-x[good][ord]
n<-length(ord.x)
P<-ppoints(n)
z<-qnorm(P)
plot(z,ord.x,type="n")
coef<-coef(rlm(ord.x~z))
a<-coef[1]
b<-coef[2]
abline(a,b,col="red",lwd=2)
conf<-0.95
zz<-qnorm(1-(1-conf)/2)
SE<-(b/dnorm(z))*sqrt(P*(1-P)/n)     #[WHY?]
fit.value<-a+b*z
upper<-fit.value+zz*SE
lower<-fit.value-zz*SE
lines(z,upper,lty=2,lwd=2,col="red")
lines(z,lower,lty=2,lwd=2,col="red")

The question is: what is the justification for the formula used to compute these SE (e.g. the line SE<-(b/dnorm(z))*sqrt(P*(1-P)/n)).

FWIW this formula is very different from the formula of the usual confidence bands used in linear regression

Glen_b
  • 257,508
  • 32
  • 553
  • 939
user603
  • 21,225
  • 3
  • 71
  • 135
  • 2
    I expect it's to do with [the distribution of order statistics](http://en.wikipedia.org/wiki/Order_statistic#The_joint_distribution_of_the_order_statistics_of_an_absolutely_continuous_distribution) $$f_{X_{(k)}}(x) =\frac{n!}{(k-1)!(n-k)!}[F_X(x)]^{k-1}[1-F_X(x)]^{n-k} f_X(x)$$ and more particularly [the asymptotic result](http://en.wikipedia.org/wiki/Order_statistic#Large_sample_sizes): $$X_{(\lceil np \rceil)} \sim AN\left(F^{-1}(p),\frac{p(1-p)}{n[f(F^{-1}(p))]^2}\right)$$ – Glen_b Aug 10 '14 at 05:37
  • 4
    @Glen_b is right. [John Fox](http://goo.gl/jtroHD) writes on pages 35-36: "The standard error of the order statistic $X_{(i)}$ is $$\mathrm{SE}(X_{(i)})=\frac{\hat{\sigma}}{p(z_i)}\sqrt{\frac{P_i(1-P_i)}{n}}$$ where $p(z)$ is the probability density function corresponding to the CDF $P(z)$. The values along the fitted line are given by $\widehat{X}_{(i)}=\hat{\mu}+\hat{\sigma}z_{i}$. An approximate 95% confidence "envelope" around the fitted line is, therefore, $\widehat{X}_{(i)}\pm 2\times \mathrm{SE}(X_{(i)})$." – COOLSerdash Aug 10 '14 at 08:26
  • 2
    I think the only thing that remains to see is that $f(F^{-1}(p))$ is estimated by $(p(z_i)/\hat{\sigma})$ in the equation COOLSerdash gave. – Glen_b Aug 10 '14 at 15:38

1 Answers1

6

It has to do with the distribution of order statistics $$f_{X_{(k)}}(x) =\frac{n!}{(k-1)!(n-k)!}[F_X(x)]^{k-1}[1-F_X(x)]^{n-k} f_X(x)$$ and more particularly the asymptotic result: $$X_{(\lceil np \rceil)} \sim AN\left(F^{-1}(p),\frac{p(1-p)}{n[f(F^{-1}(p))]^2}\right)$$

As COOLSerdash mentions in comments, John Fox [1] writes on pages 35-36:

The standard error of the order statistic $X_{(i)}$ is $$\mathrm{SE}(X_{(i)})=\frac{\hat{\sigma}}{p(z_i)}\sqrt{\frac{P_i(1-P_i)}{n}}$$ where $p(z)$ is the probability density function corresponding to the CDF $P(z)$. The values along the fitted line are given by $\widehat{X}_{(i)}=\hat{\mu}+\hat{\sigma}z_{i}$. An approximate 95% confidence "envelope" around the fitted line is, therefore, $\widehat{X}_{(i)}\pm 2\times \mathrm{SE}(X_{(i)})$.

Then we just need to recognize that $f(F^{-1}(p))$ is estimated by $(p(z_i)/\hat{\sigma})$.

[1] Fox, J. (2008),
Applied Regression Analysis and Generalized Linear Models, 2nd Ed.,
Sage Publications, Inc

Glen_b
  • 257,508
  • 32
  • 553
  • 939