18

To make this chart I generated random samples of different size from a normal distribution with mean=0 and sd=1. Confidence intervals were then calculated using alpha cutoffs ranging from .001 to .999 (red line) with the t.test() function, the profile likelihood was calculated using the code below which I found in lecture notes put on line (I can't find the link at the moment Edit:Found it), this is shown by the blue lines. Green lines show the normalized density using the R density() function and the data is shown by the boxplots at the bottom of each chart. On the right is a caterpillar plot of the 95% confidence intervals (red) and 1/20th of max likelihood intervals (blue).

R Code used for profile likelihood:

  #mn=mean(dat)
  muVals <- seq(low,high, length = 1000)
  likVals <- sapply(muVals,
                    function(mu){
                      (sum((dat - mu)^2) /
                         sum((dat - mn)^2)) ^ (-n/2)
                    }
  )

enter image description here

My specific question is whether there is a known relationship between these two types of intervals and why the confidence interval appears to be more conservative for all cases except when n=3. Comments/answers about whether my calculations are valid (and a better way to do this) and the general relationship between these two types of intervals are also desired.

R code:

samp.size=c(3,4,5,10,20,1000)
cnt2<-1
ints=matrix(nrow=length(samp.size),ncol=4)
layout(matrix(c(1,2,7,3,4,7,5,6,7),nrow=3,ncol=3, byrow=T))
par(mar=c(5.1,4.1,4.1,4.1))
for(j in samp.size){


  #set.seed(200)
  dat<-rnorm(j,0,1)
  vals<-seq(.001,.999, by=.001)
  cis<-matrix(nrow=length(vals),ncol=3)
  cnt<-1
  for(ci in vals){
    x<-t.test(dat,conf.level=ci)$conf.int[1:2]
    cis[cnt,]<-cbind(ci,x[1],x[2])
    cnt<-cnt+1
  }


  mn=mean(dat)
  n=length(dat)
  high<-max(c(dat,cis[970,3]), na.rm=T)
  low<-min(c(dat,cis[970,2]), na.rm=T)
  #high<-max(abs(c(dat,cis[970,2],cis[970,3])), na.rm=T)
  #low<--high


  muVals <- seq(low,high, length = 1000)
  likVals <- sapply(muVals,
                    function(mu){
                      (sum((dat - mu)^2) /
                         sum((dat - mn)^2)) ^ (-n/2)
                    }
  )


  plot(muVals, likVals, type = "l", lwd=3, col="Blue", xlim=c(low,high),
       ylim=c(-.1,1), ylab="Likelihood/Alpha", xlab="Values",
       main=c(paste("n=",n), 
              "True Mean=0 True sd=1", 
              paste("Sample Mean=", round(mn,2), "Sample sd=", round(sd(dat),2)))
  )
  axis(side=4,at=seq(0,1,length=6),
       labels=round(seq(0,max(density(dat)$y),length=6),2))
  mtext(4, text="Density", line=2.2,cex=.8)

  lines(density(dat)$x,density(dat)$y/max(density(dat)$y), lwd=2, col="Green")
  lines(range(muVals[likVals>1/20]), c(1/20,1/20), col="Blue", lwd=4)
  lines(cis[,2],1-cis[,1], lwd=3, col="Red")
  lines(cis[,3],1-cis[,1], lwd=3, col="Red")
  lines(cis[which(round(cis[,1],3)==.95),2:3],rep(.05,2), 
        lty=3, lwd=4, col="Red")
  abline(v=mn, lty=2, lwd=2)
  #abline(h=.05, lty=3, lwd=4, col="Red")
  abline(h=0, lty=1, lwd=3)
  abline(v=0, lty=3, lwd=1)

  boxplot(dat,at=-.1,add=T, horizontal=T, boxwex=.1, col="Green")
  stripchart(dat,at=-.1,add=T, pch=16, cex=1.1)

  legend("topleft", legend=c("Likelihood"," Confidence Interval", "Sample Density"),
         col=c("Blue","Red", "Green"), lwd=3,bty="n")

  ints[cnt2,]<-cbind(range(muVals[likVals>1/20])[1],range(muVals[likVals>1/20])[2],
                     cis[which(round(cis[,1],3)==.95),2],cis[which(round(cis[,1],3)==.95),3])
  cnt2<-cnt2+1
}
par(mar=c(5.1,4.1,4.1,2.1))


plot(0,0, type="n", ylim=c(1,nrow(ints)+.5), xlim=c(min(ints),max(ints)), 
     yaxt="n", ylab="Sample Size", xlab="Values")
for(i in 1:nrow(ints)){
  segments(ints[i,1],i+.2,ints[i,2],i+.2, lwd=3, col="Blue")
  segments(ints[i,3],i+.3,ints[i,4],i+.3, lwd=3, col="Red")
}
axis(side=2, at=seq(1.25,nrow(ints)+.25,by=1), samp.size)
Flask
  • 1,711
  • 1
  • 14
  • 24
  • In you lecture notes, `mn` is a typo for `mu`, and not `mean(dat)`. As I told you in the comments to [your other question](http://stats.stackexchange.com/questions/78339/how-to-use-profile-likelihood), this should be clear from the definitions page 23. – Elvis Dec 02 '13 at 21:10
  • @Elvis I don't think so. mn is defined on page 18 of the notes. – Flask Dec 02 '13 at 21:31
  • I tried to clarify the concept of profile likelihood. Can you comment a bit further on what you are doing in the above code? – Elvis Dec 02 '13 at 21:32
  • Page 18 of the notes, it’s a $t$-test. It has nothing to do with the profile likelihood. Please read pages 22-24 which are relevant for profile likelihood. – Elvis Dec 02 '13 at 21:35
  • In fact, I don’t understand the code page 25. It seems all wrong to me. – Elvis Dec 02 '13 at 21:37
  • 3
    @Elvis Neither do I understand. A confidence interval based on the profile likelihood should be constructed with the help of the $\chi^2$ percentiles, which appear nowhere. – Stéphane Laurent Dec 02 '13 at 21:58
  • 1
    @StéphaneLaurent I am not sure the original code _is_ providing confidence intervals. Rather 1/20 max likelihood intervals. I believe the name for the confidence intervals in my plot are "wald-type" confidence intervals and the red lines on the plots are "confidence curves" described [on this wikipedia page](https://en.wikipedia.org/wiki/Confidence_distribution#Using_CD_to_make_inference) – Flask Dec 02 '13 at 22:07
  • What does it mean: "1/20 max-likelihood intervals" ? Does that mean the interval supports 95% of the area defined by the curve ? – Stéphane Laurent Dec 02 '13 at 22:13
  • @StéphaneLaurent it is the region that contains likelihoods > 1/20th the max likelihood. This is my own term for what I observed on pages 25-26 of the lecture notes. In the notes horizontal lines for 1/8 and 1/16 were drawn. – Flask Dec 02 '13 at 22:16

3 Answers3

10

I will not give a complete answer (I have a hard time trying to understand what you are doing exactly), but I will try to clarify how profile likelihood is built. I may complete my answer later.

The full likelihood for a normal sample of size $n$ is $$L(\mu, \sigma^2) = \left( \sigma^2 \right)^{-n/2} \exp\left( - \sum_i (x_i-\mu)^2/2\sigma^2 \right).$$

If $\mu$ is your parameter of interest, and $\sigma^2$ is a nuisance parameter, a solution to make inference only on $\mu$ is to define the profile likelihood $$L_P(\mu) = L\left(\mu, \widehat{\sigma^2}(\mu) \right)$$ where $\widehat{\sigma^2}(\mu)$ is the MLE for $\mu$ fixed: $$\widehat{\sigma^2}(\mu) = \text{argmax}_{\sigma^2} L(\mu, \sigma^2).$$

One checks that $$\widehat{\sigma^2}(\mu) = {1\over n} \sum_k (x_k - \mu)^2.$$

Hence the profile likelihood is $$L_P(\mu) = \left( {1\over n} \sum_k (x_k - \mu)^2 \right)^{-n/2} \exp( -n/2 ).$$

Here is some R code to compute and plot the profile likelihood (I removed the constant term $\exp(-n/2)$):

> data(sleep)
> difference <- sleep$extra[11:20]-sleep$extra[1:10]
> Lp <- function(mu, x) {n <- length(x); mean( (x-mu)**2 )**(-n/2) }
> mu <- seq(0,3, length=501)
> plot(mu, sapply(mu, Lp, x = difference), type="l")

profile likelihood

Link with the likelihood I’ll try to highlight the link with the likelihood with the following graph.

First define the likelihood:

L <- function(mu,s2,x) {n <- length(x); s2**(-n/2)*exp( -sum((x-mu)**2)/2/s2 )}

Then do a contour plot:

sigma <- seq(0.5,4, length=501)
mu <- seq(0,3, length=501)

z <- matrix( nrow=length(mu), ncol=length(sigma))
for(i in 1:length(mu))
  for(j in 1:length(sigma))
    z[i,j] <- L(mu[i], sigma[j], difference)

# shorter version
# z <- outer(mu, sigma, Vectorize(function(a,b) L(a,b,difference)))

contour(mu, sigma, z, levels=c(1e-10,1e-6,2e-5,1e-4,2e-4,4e-4,6e-4,8e-4,1e-3,1.2e-3,1.4e-3))

And then superpose the graph of $\widehat{\sigma^2}(\mu)$:

hats2mu <- sapply(mu, function(mu0) mean( (difference-mu0)**2 ))
lines(mu, hats2mu, col="red", lwd=2)

contour plot of L

The values of the profile likelihood are the values taken by the likelihood along the red parabola.

You can use the profile likelihood just as a univariate classical likelihood (cf @Prokofiev’s answer). For example, the MLE $\hat\mu$ is the same.

For your confidence interval, the results will differ a little because of the curvature of the function $\widehat{\sigma^2}(\mu)$, but as long that you deal only with a short segment of it, it’s almost linear, and the difference will be very small.

You can also use the profile likelihood to build score tests, for example.

Elvis
  • 11,870
  • 36
  • 56
  • mu in the code is a sequence of values from low to high, the likelihood at each of these values is being divided by the likelihood at the sample mean (mn). So it is a normalized likelihood. – Flask Dec 02 '13 at 21:35
  • I think this is the same thing but not normalized. Can you put it in R code or otherwise plot the function for some data so we can compare? – Flask Dec 02 '13 at 21:44
  • Here it is. At first I thought `mn` was a typo, now I think the R code is all wrong. I’ll double check it tomorrow – it is late were I live. – Elvis Dec 02 '13 at 21:51
  • You may be right. I don’t understand how the code manages to normalize it. Oh, I get it, the "normalization" is just dividing by the maximum? – Elvis Dec 02 '13 at 21:55
  • the 1/n term cancels and the likelihoods you are calculating are simply divided by the maximum likelihood (which occurs at the sample mean). – Flask Dec 02 '13 at 21:59
  • Usually "normalization" means area under curve = 1. I don't get the point of this kind of normalization. – Elvis Dec 02 '13 at 22:08
  • 1
    I think it is to make it easy to see when the likelihood ratio is less than some threshold (eg 1/20th max) at some null hypothesis (eg zero). – Flask Dec 02 '13 at 22:13
  • Ok. This makes sense in a bayesian approach. – Elvis Dec 02 '13 at 22:21
  • I think that final plot may be exactly what I am looking for, but I will need some time to study it. Ironically, your "professional" R code is more difficult for me to understand since it uses functions I am not familiar with. – Flask Dec 04 '13 at 13:49
  • :) the use of outer and Vectorize is kind of dirty. I'll correct it to something simpler. – Elvis Dec 04 '13 at 14:05
  • done. I may complete with a CI interval computation in the next few days. – Elvis Dec 05 '13 at 21:13
7

In a general framework, profile likelihood intervals are approximate confidence intervals. The proof of this result is essentially the same as proving that the likelihood ratio statistic is (asymptotically) approximately distributed as a $\chi^2_k$ distribution. The idea consists of inverting the hypothesis test obtained from a likelihood ratio statistic.

For example, a $0.147$-level profile likelihood interval has an approximate confidence of $95\%$.

These are classical results and therefore I will simply provide some references on this:

http://www.jstor.org/stable/2347496

http://www.stata-journal.com/sjpdf.html?articlenum=st0132

http://www.unc.edu/courses/2010fall/ecol/563/001/docs/lectures/lecture11.htm

http://en.wikipedia.org/wiki/Likelihood-ratio_test

http://en.wikipedia.org/wiki/Likelihood_function#Profile_likelihood

The following R code shows that, even for small samples, the intervals obtained with both approaches are similar (I am re-using Elvis example):

Note that you have to use the normalised profile likelihood.

data(sleep)
x <- sleep$extra[11:20]-sleep$extra[1:10]
n <- length(x)
Rp <- function(mu) {mean( (x-mean(x))^2 )^(n/2)/mean( (x-mu)^2 )^(n/2) }
Rp(mean(x))

mu <- seq(0,3, length=501)
plot(mu, sapply(mu, Rp), type="l")


Rpt<- function(mu) Rp(mu)-0.147 # Just an instrumental function

# Likelihood-confidence interval of 95% level

c(uniroot(Rpt,c(0.5,1.5))$root,uniroot(Rpt,c(1.51,3))$root)

# t confidence interval

t.test(x,conf.level=0.95)$conf.int

If we use a larger sample size, the confidence intervals are even closer:

set.seed(123)
x <- rnorm(100)
n <- length(x)
Rp <- function(mu) {mean( (x-mean(x))^2 )^(n/2)/mean( (x-mu)^2 )^(n/2) }
Rp(mean(x))

mu <- seq(-0.5,0.5, length=501)
plot(mu, sapply(mu, Rp), type="l")


Rpt<- function(mu) Rp(mu)-0.147 # Just an instrumental function

# Likelihood-confidence interval of 95% level

c(uniroot(Rpt,c(-0.4,0))$root,uniroot(Rpt,c(0,0.4))$root)

# t confidence interval

t.test(x,conf.level=0.95)$conf.int

AN IMPORTANT POINT:

Note that for specific samples different kinds of confidence intervals may differ in terms of their length or location, what really matters is their coverage. In the long run, all of them should provide the same coverage, independently on how much they differ for specific samples.

Prokofiev
  • 71
  • 2
  • @Prokoflev if there is some simple relationship between the confidence intervals calculated with the R t.test() function and by those calculated by the likelihood function code above can you post it. I am especially interested in the n=3 case. Unfortunately I have little background in math so many papers lead me down the rabbit hole looking up the names for symbols and what they represent etc when a few lines of code (easiest is R) could explain it to me. – Flask Dec 02 '13 at 23:01
  • @Flask Are you interested in obtaining confidence intervals for the parameters of a normal distribution or a more general framework? – Prokofiev Dec 02 '13 at 23:04
  • @Prokoflev specifically for the mean of a normal distribution as shown in my example in the question. I am especially wondering why the confidence intervals are more conservative except in the n=3 case. – Flask Dec 02 '13 at 23:07
  • @Flask What level of confidence are you interested in? $95\%$? – Prokofiev Dec 02 '13 at 23:11
  • "All of them". In the charts you can see the red curves are confidence intervals at levels from .999 to .001, I would like to see how the relationship is affected by the choice of "confidence level". – Flask Dec 02 '13 at 23:13
  • @Flask Then, this is just a programming issue. A better venue for this sort of questions is [stack overflow](http://stackoverflow.com/). Sorry if I can't be of more help on providing support with your codes. – Prokofiev Dec 02 '13 at 23:28
  • I think you were of great help, I am looking through your links. It just takes me extra effort to understand the answer. But for example I believe [this](http://www.unc.edu/courses/2010fall/ecol/563/001/docs/lectures/lecture11.htm#comparing) is a more complicated example of what I am asking. – Flask Dec 02 '13 at 23:34
  • @Flask Indeed. For moderate samples, say at least 30 observations, both approaches should provide very similar intervals. I am afraid that there is an error in your calculation of the likelihood-confidence intervals. Moreover, I think this is related to a wrong choice of the profile likelihood level. – Prokofiev Dec 02 '13 at 23:37
  • 1
    I am beginning to believe I should be multiplying the likelihood intervals by some quantile of either the normal or chisquare distribution to get the corresponding confidence interval.. – Flask Dec 02 '13 at 23:53
  • @Flask Please, take a look at my examples. You will see how, in the case of $95\%$ c.i., both approaches provide similar results. I have to leave now, but you can try to figure out how to extend this method to other confidence levels. I hope this helps. – Prokofiev Dec 02 '13 at 23:55
  • @Prokofiev Can you please say where you got 0.147. I understand this is related to 95% confidence somehow. It appears to come from some obvious background knowledge I do not have – Flask Dec 03 '13 at 13:28
1

I will not give an overly mathematical answer, but I would like to address your central question about the relationship between CI's and profile likelihood intervals. As the other respondents have pointed out, CI's can be constructed from a profile likelihood by using the $\chi^2$ approximation to the $normalized$ likelihood ratio. The accuracty of this approach depends on one of two things being approximately true:

  1. The profile log-likelihood is approximatley quadratic
  2. There exists a parameter transform that makes the profile log-likelihood approximately quadratic.

The quadratic is important because it defines a normal distribution in log-scale. The more quadratic it is, the better the approximation and the resulting CIs'. Your choice of 1/20th cutoff for the likelihood intervals is equivalent to more than a 95% CI in the asymptotic limit, whcih is why the blue intervals are generally longer than the red ones.

Now, there is another issue with profile likelihood that needs some attention. If you have a lot of variables that you are profiling over, then if the number of data points per dimension is low, the profile likelihood can be very biased and optimistic. Marginal, conditional, and modified profile likelihoods are then used to reduce this bias.

So, the answer to your question is YES...the connection is the asymptotic normality of most maximum likelihood estimators, as manifested in the chi-squared distribution of the likelihood ratio.

  • "_If you have a lot of variables that you are profiling over, then if the number of data points per dimension is low, the profile likelihood can be very biased and optimistic_" Optimistic compared to what? – Flask Dec 03 '13 at 14:11
  • @Flask By optimistic I mean it will be too narrow to provide the nominal coverage probability when treating it as a confidence interval. –  Dec 03 '13 at 14:18
  • I see, thanks, but in my specific case it is actually pessimistic? I am confused on this point as to whether we are talking about the likelihood intervals or confidence intervals derived from the likelihoods. – Flask Dec 03 '13 at 14:20
  • @Flask I think you intervals appear pessimistic because you are comparing 1/20 th likelihood interval (5% relative likelihood) with a 95% CI. As stated by others here, you would really want to compare it to a 15% relative likelihood interval to have apples to apples...at least asymptotically. Your likelihood interval as it stands is considering more options as plausbile. –  Dec 03 '13 at 17:20
  • I have detailed the actual problem I wish to apply what I am learning to [here](http://stats.stackexchange.com/questions/78435/how-can-i-use-likelihoods-to-compare-these-three-groups-should-i-want-to-do-thi?lq=1). I worry that in the case where the sampling distribution is unknown (but not probably not normal) and complex that your two requirements may not hold. Yet the profile likelihoods I calculated do appear to be normal and reasonable. Is it that the sampling distribution of the _mean_ should be normally distributed? – Flask Dec 04 '13 at 14:02
  • @Flask Yes. The distribution of your underlying data is relevant only in the sense that it will affect the *covergence rate* of our statistic's sampling distribution to the normal distribution. The nice thing with likelihood statitics is that you don't need to guess about convergence...if the log-likelihood is quadratic, or nearly quadratic, then normality holds. A detailed paper outlining this "asymptotic-free" property of the likelihood ratio can be found at: http://www.stat.umn.edu/geyer/lecam/simple.pdf –  Dec 04 '13 at 14:23