Computing confidence limits for $\beta$ estimates on the reference level of a categorical variable

Question

tl;dr I have some $\beta$ estimates from a logistic regression model. One of the predictors is a categorical variable with effects coding applied. I want to compute confidence limits for all levels of that variable, including the one with all contrasts set to -1. I am accustomed to using the profile methods of MASS::confint.glm(), but that function doesn't calculate limits for the reference level. I'm attempting to recreate the output of confint.glm() using a Z-approximation, but seem to be failing. I'm not sure where I'm going off the rails exactly, but suspect it has something to do with my selection of the appropriate $N$ to use for calculating the CI width. In fact, I've come fairly close to the profile limits by using $N=1$, but I would like to be sure that I'm making statistically-valid choices, not just finding numbers that match.

So my questions are:

Is $N=1$ the way to go here? If so, why is that the case? My intuition is that $N$ might be the number of cases falling into the particular category for which the estimate is being made.
If $N=1$ is not the way to go, where am I going wrong?

First, set up the environment and create some play data:

Setup

library(MASS)         # for `confint.glm()`
library(plyr)         # for `mapvalues()`
library(tidyverse)    # for the forward-pipe and friends

set.seed(1138)
N <- 2000                                   # establish sample size
alpha <- 0.05                               # define desired confidence level
fruits <- c("papaya", "pomegranate", 
            "soursop", "guava", "durian")   # possible fruits
animals <- c("tribble", "ewok")             # experimental species (Empire- and Federation-approved)
eaten <- c("no", "yes")                     # possible treatments
location <- c("Tatooine", "Endor")          # environmental variable

# Craft the data
.df <- data.frame(ill = rbinom(n=N, size=1, prob=0.267), 
                  fruit = fruits[ round( runif(n=N, min=1, max=length(fruits) ) ) ], 
                  animal = animals[ round( runif(n=N, min=1, max=length(animals) ) ) ], 
                  eaten = eaten[ round( runif(n=N, min=1, max=length(eaten) ) ) ], 
                  planet = location[ round( runif(n=N, min=1, max=length(location) ) ) ] )

# Use effects coding/sum-to-zero contrasts on `.df$fruit`
.df <- within( .df, {
  contrasts(fruit) <- contr.sum( levels(fruit) )
})

The data then look as follows—a binary response (ill), three binary predictors (animal, eaten, and planet), and one categorical predictor (fruit):

##   ill       fruit  animal eaten   planet
## 1   0       guava tribble   yes    Endor
## 2   0      papaya tribble    no Tatooine
## 3   0      papaya tribble   yes    Endor
## 4   0 pomegranate    ewok   yes Tatooine
## 5   0       guava    ewok   yes    Endor
## 6   0     soursop    ewok   yes Tatooine

Now I define the model...

im1 <- glm( ill ~ fruit + animal + eaten + planet, 
            data = .df, family = binomial(link="logit") )

Profile limits

...and calculate the odds ratios and 95% confidence limits using the profile method of MASS::confint.glm(). These are what I'm trying to replicate.

im1.OR_profile <- exp( cbind( OR = coef(im1),          # exponentiate the parameter-estimate coefficients...
                       confint(im1) ) ) %>%            # ...together with the conf limits to get odds-ratio estimates
  as.data.frame.matrix( . ) %>% 
  mutate( Parameter = rownames( . ) ) %>% 
  filter( grepl( "Intercept|fruit", Parameter ) ) %>%  # for simplicity, I'm restricting this to the variable under sum-to-zero contrasts
  select( Parameter, everything() )

That gives me the following estimates and profile confidence limits:

##     Parameter        OR     2.5 %    97.5 %
## 1 (Intercept) 0.3082855 0.2492946 0.3794741
## 2      fruit1 1.2874891 1.0121429 1.6273995
## 3      fruit2 0.9159739 0.7543939 1.1079502
## 4      fruit3 0.8640049 0.6672064 1.1066674
## 5      fruit4 0.8450893 0.6919834 1.0273585

But it only gives me limits for four of the five levels of fruit. It's leaving out fruit5, the all–-1 contrasts level. I can calculate the $\beta$ estimate for that level, but to get confidence limits I then need to use a Z approximation. So I try that:

Z-approximation limits:

beta <- coef(im1)                                                       # extract parameter estimates from the model
beta[["fruit5"]] <- -sum( beta[ grepl("^fruit[1-4]", names(beta)) ] )   # compute beta estimate for reference level of 'fruit'

vcmat <- vcov(im1)                        # extract variance-covariance matrix from model

Rinds <- grepl("^fruit", rownames(vcmat)) # find relevant rows of vcmat (corresponding to levels of sum-to-zero factor)
Cinds <- grepl("^fruit", colnames(vcmat)) # find relevant cols of vcmat (corresponding to levels sum-to-zero factor)

vcmat_tiny <- vcmat[ Rinds, Cinds ]       # restrict vcmat to relevant submatrix

This gives me the following variance-covariance (sub)matrix in vcmat_tiny:

##              fruit1       fruit2       fruit3       fruit4
## fruit1  0.014639932 -0.003104047 -0.005432924 -0.003282766
## fruit2 -0.003104047  0.009599889 -0.003754821 -0.001598346
## fruit3 -0.005432924 -0.003754821  0.016606235 -0.003944584
## fruit4 -0.003282766 -0.001598346 -0.003944584  0.010146859

Next I compute the $\beta$ estimate and standard deviation of the estimate for fruit5:

beta.s2 <- diag(vcmat)                 # extract variances for each parameter from vcmat

beta.s2[["fruit5"]] <- sum(vcmat_tiny) # compute variance of reference level as 
                                       # Var(f1 + f2 + f3 + f4) = Var(f1) + Var(f2) + Var(f3) + Var(f4) + 
                                       #         2Cov(f1,f2) + 2Cov(f1,f3) + 2Cov(f1,f4) + 2Cov(f2,f3) + 
                                       #         2Cov(f2,f4) + 2Cov(f3,f4)

beta.s <- sqrt(beta.s2)                # calculate SD of parameter estimates as square root of variances

beta <- beta[ grepl("Intercept|fruit", names(beta)) ]        # for simplicity, I will limit this to the factor
beta.s <- beta.s[ grepl("Intercept|fruit", names(beta.s)) ]  #   under sum-to-zero contrasts

This is where I start to really lose confidence in my calculations. I know (read: I think I know) that the half-width of the confidence interval is computed as $\bar{x}±z\frac{s}{\sqrt{n}}$.

I have $\bar{x}$—that's the vector of $\beta$ coefficients; I just calculated their standard deviations, $s$. $z$ is an easy calculation. That leaves $n$. Now, is this $n=N$? That is, the size of the entire sample, 2000? Or, does each level get its own $n$ corresponding to the number of cases in that level in the data? In fact, based on what I've experienced so far while trying to replicate the output of MASS::confint(), it seems that neither of these is the case. Instead, I get closest to the values produced by that function's methods if I don't divide $s$ at all (effectively, $n=1$).

I'll go through all cases here.

Population $N$:

z <- qnorm(1 - (alpha / 2))        # compute the critical z quantile corresponding to the desired confidence level
CIWidth.N <- z * beta.s / sqrt(N)  # compute the half-width of the confidence interval (N=2000 for all parameters)

CIlo.N <- beta - CIWidth.N         # calculate lower limit of CI
CIhi.N <- beta + CIWidth.N         # calculate upper limit of CI

im1.OR_popN <- exp( cbind( OR = beta,  # put it all together and exponentiate
                           `2.5 %` = CIlo.N, 
                           `97.5 %` = CIhi.N ) ) %>% 
  as.data.frame.matrix( . ) %>% 
  mutate( Parameter = rownames( . ) ) %>% 
  select( Parameter, everything() )

Comparing these estimates + limits to those computed earlier I have:

##     Parameter        OR     2.5 %    97.5 %   Parameter        OR     2.5 %    97.5 %
## 1 (Intercept) 0.3082855 0.2492946 0.3794741 (Intercept) 0.3082855 0.3068414 0.3097363
## 2      fruit1 1.2874891 1.0121429 1.6273995      fruit1 1.2874891 1.2806799 1.2943345
## 3      fruit2 0.9159739 0.7543939 1.1079502      fruit2 0.9159739 0.9120491 0.9199156
## 4      fruit3 0.8640049 0.6672064 1.1066674      fruit3 0.8640049 0.8591390 0.8688983
## 5      fruit4 0.8450893 0.6919834 1.0273585      fruit4 0.8450893 0.8413667 0.8488284
## 6        <NA>        NA        NA        NA      fruit5 1.1613271 1.1565738 1.1661000

Here, the original profile (confint()) estimates and limits are shown on the left; the corresponding Z-approximation ($n=N$) limits are on the right. Clearly, the latter intervals are far too small; $N$ is too big. Maybe the more appropriate $n$ to use is the $n_i$ in each level of the fruit factor?

Level $n_i$:

n_i <- table(.df$fruit)                                    # determine the number of cases for each level of `fruit`
fruitlevels <- c(fruit1="durian", fruit2="guava",          # draft a map of factor levels --> contrast labels
                 fruit3="papaya", fruit4="pomegranate", 
                 fruit5="soursop" )

names(n_i) <- mapvalues( names(n_i),                       # rename the table of cases according to that map
                         fruitlevels, names(fruitlevels) )

n_i[["(Intercept)"]] <- mean(n_i)                          # add an `n` for the 'average' case as the 'average'
n_i <- n_i[ match( names(beta.s), names(n_i) ) ]           #     of the `n`s????; rearrange to match beta.s

CIWidth.n_i <- z * beta.s / sqrt(n_i)                      # compute the CI half-width using level-specific `n`s

CIlo.n_i <- beta - CIWidth.n_i                             # calculate the lower confidence limit
CIhi.n_i <- beta + CIWidth.n_i                             # calculate the upper confidence limit

im1.OR_n_i <- exp( cbind( OR = beta,                       # put everything together and exponentiate
                          `2.5 %` = CIlo.n_i, 
                          `97.5 %` = CIhi.n_i ) ) %>% 
  as.data.frame.matrix( . ) %>% 
  mutate( Parameter = rownames( . ) ) %>% 
  select( Parameter, everything() )

Again, I'll compare these to the limits estimated using profile methods:

##     Parameter        OR     2.5 %    97.5 %   Parameter        OR     2.5 %    97.5 %
## 1 (Intercept) 0.3082855 0.2492946 0.3794741 (Intercept) 0.3082855 0.3050658 0.3115392
## 2      fruit1 1.2874891 1.0121429 1.6273995      fruit1 1.2874891 1.2680903 1.3071846
## 3      fruit2 0.9159739 0.7543939 1.1079502      fruit2 0.9159739 0.9081644 0.9238505
## 4      fruit3 0.8640049 0.6672064 1.1066674      fruit3 0.8640049 0.8504203 0.8778065
## 5      fruit4 0.8450893 0.6919834 1.0273585      fruit4 0.8450893 0.8375471 0.8526995
## 6        <NA>        NA        NA        NA      fruit5 1.1613271 1.1519695 1.1707607

Still too narrow! Again, the confint() limits are on the left; the corresponding $n=n_i$ limits are on the right.

I guess in the extreme case, $n=1$. I'll try that.

Singular $n$:

CIWidth.1 <- z * beta.s / sqrt(1)                  # compute the half-width of the confidence interval using n=1

CI.lo1 <- beta - CIWidth.1                         # calculate the lower confidence limit
CI.hi1 <- beta + CIWidth.1                         # calculate the upper confidence limit

im1.OR_1 <- exp( cbind( OR = beta,                 # put everything together and exponentiate
                        `2.5 %` = CI.lo1, 
                        `97.5 %` = CI.hi1 ) ) %>% 
  as.data.frame.matrix( . ) %>% 
  mutate( Parameter = rownames( . ) ) %>% 
  select( Parameter, everything() )

One last time! Compare:

##     Parameter        OR     2.5 %    97.5 %   Parameter        OR     2.5 %   97.5 %
## 1 (Intercept) 0.3082855 0.2492946 0.3794741 (Intercept) 0.3082855 0.2498973 0.380316
## 2      fruit1 1.2874891 1.0121429 1.6273995      fruit1 1.2874891 1.0156684 1.632056
## 3      fruit2 0.9159739 0.7543939 1.1079502      fruit2 0.9159739 0.7559328 1.109898
## 4      fruit3 0.8640049 0.6672064 1.1066674      fruit3 0.8640049 0.6711598 1.112260
## 5      fruit4 0.8450893 0.6919834 1.0273585      fruit4 0.8450893 0.6936808 1.029545
## 6        <NA>        NA        NA        NA      fruit5 1.1613271 0.9667092 1.395126

Much closer; now the CIs are just barely too wide, and appear to be slightly up-shifted.

To reiterate my questions:

Is $N=1$ the way to go here? If so, why is that the case? My intuition is that $N$ might be the number of cases falling into the particular category for which the estimate is being made.
If $N=1$ is not the way to go, where am I going wrong?

(If you've made it this far, thank you for reading!)

The profile likelihood based intervals are supposed to be different from the Wald intervals so any resemblance is happenstance. — mdewey, Apr 28 '18 at 13:14

atiretoo · Accepted Answer · 2018-04-26T23:35:44.260

Yes. You’re not estimating the variance of the parameter from the variance of a sample, you already have the variance of the parameter. That’s why N = 1 seems about right. You don't need to divide the variance by the square root of the sample size in this case. The rest of the difference is simply due to differences between the likelihood profiles and the Z score approximation.

There is some discussion about whether a t distribution is better than the Z distribution in this case. See this crossvalidated question for a comparison of the Z test you use with using a t distribution. With a t distribution the confidence limits will be wider with a smaller sample size.

You’d have to “roll your own” profile method to get profile intervals on the 5th fruit parameter – IMO the Z approximations are good enough.

Computing confidence limits for $\beta$ estimates on the reference level of a categorical variable

Setup

Profile limits

Z-approximation limits:

1 Answers1