tl;dr I have some $\beta$ estimates from a logistic regression model. One of the predictors is a categorical variable with effects coding applied. I want to compute confidence limits for all levels of that variable, including the one with all contrasts set to -1
. I am accustomed to using the profile methods of MASS::confint.glm()
, but that function doesn't calculate limits for the reference level. I'm attempting to recreate the output of confint.glm()
using a Z-approximation, but seem to be failing. I'm not sure where I'm going off the rails exactly, but suspect it has something to do with my selection of the appropriate $N$ to use for calculating the CI width. In fact, I've come fairly close to the profile limits by using $N=1$, but I would like to be sure that I'm making statistically-valid choices, not just finding numbers that match.
So my questions are:
- Is $N=1$ the way to go here? If so, why is that the case? My intuition is that $N$ might be the number of cases falling into the particular category for which the estimate is being made.
- If $N=1$ is not the way to go, where am I going wrong?
First, set up the environment and create some play data:
Setup
library(MASS) # for `confint.glm()`
library(plyr) # for `mapvalues()`
library(tidyverse) # for the forward-pipe and friends
set.seed(1138)
N <- 2000 # establish sample size
alpha <- 0.05 # define desired confidence level
fruits <- c("papaya", "pomegranate",
"soursop", "guava", "durian") # possible fruits
animals <- c("tribble", "ewok") # experimental species (Empire- and Federation-approved)
eaten <- c("no", "yes") # possible treatments
location <- c("Tatooine", "Endor") # environmental variable
# Craft the data
.df <- data.frame(ill = rbinom(n=N, size=1, prob=0.267),
fruit = fruits[ round( runif(n=N, min=1, max=length(fruits) ) ) ],
animal = animals[ round( runif(n=N, min=1, max=length(animals) ) ) ],
eaten = eaten[ round( runif(n=N, min=1, max=length(eaten) ) ) ],
planet = location[ round( runif(n=N, min=1, max=length(location) ) ) ] )
# Use effects coding/sum-to-zero contrasts on `.df$fruit`
.df <- within( .df, {
contrasts(fruit) <- contr.sum( levels(fruit) )
})
The data then look as follows—a binary response (ill
), three binary predictors (animal
, eaten
, and planet
), and one categorical predictor (fruit
):
## ill fruit animal eaten planet
## 1 0 guava tribble yes Endor
## 2 0 papaya tribble no Tatooine
## 3 0 papaya tribble yes Endor
## 4 0 pomegranate ewok yes Tatooine
## 5 0 guava ewok yes Endor
## 6 0 soursop ewok yes Tatooine
Now I define the model...
im1 <- glm( ill ~ fruit + animal + eaten + planet,
data = .df, family = binomial(link="logit") )
Profile limits
...and calculate the odds ratios and 95% confidence limits using the profile method of MASS::confint.glm()
. These are what I'm trying to replicate.
im1.OR_profile <- exp( cbind( OR = coef(im1), # exponentiate the parameter-estimate coefficients...
confint(im1) ) ) %>% # ...together with the conf limits to get odds-ratio estimates
as.data.frame.matrix( . ) %>%
mutate( Parameter = rownames( . ) ) %>%
filter( grepl( "Intercept|fruit", Parameter ) ) %>% # for simplicity, I'm restricting this to the variable under sum-to-zero contrasts
select( Parameter, everything() )
That gives me the following estimates and profile confidence limits:
## Parameter OR 2.5 % 97.5 %
## 1 (Intercept) 0.3082855 0.2492946 0.3794741
## 2 fruit1 1.2874891 1.0121429 1.6273995
## 3 fruit2 0.9159739 0.7543939 1.1079502
## 4 fruit3 0.8640049 0.6672064 1.1066674
## 5 fruit4 0.8450893 0.6919834 1.0273585
But it only gives me limits for four of the five levels of fruit
. It's leaving out fruit5
, the all–-1
contrasts level. I can calculate the $\beta$ estimate for that level, but to get confidence limits I then need to use a Z approximation. So I try that:
Z-approximation limits:
beta <- coef(im1) # extract parameter estimates from the model
beta[["fruit5"]] <- -sum( beta[ grepl("^fruit[1-4]", names(beta)) ] ) # compute beta estimate for reference level of 'fruit'
vcmat <- vcov(im1) # extract variance-covariance matrix from model
Rinds <- grepl("^fruit", rownames(vcmat)) # find relevant rows of vcmat (corresponding to levels of sum-to-zero factor)
Cinds <- grepl("^fruit", colnames(vcmat)) # find relevant cols of vcmat (corresponding to levels sum-to-zero factor)
vcmat_tiny <- vcmat[ Rinds, Cinds ] # restrict vcmat to relevant submatrix
This gives me the following variance-covariance (sub)matrix in vcmat_tiny
:
## fruit1 fruit2 fruit3 fruit4
## fruit1 0.014639932 -0.003104047 -0.005432924 -0.003282766
## fruit2 -0.003104047 0.009599889 -0.003754821 -0.001598346
## fruit3 -0.005432924 -0.003754821 0.016606235 -0.003944584
## fruit4 -0.003282766 -0.001598346 -0.003944584 0.010146859
Next I compute the $\beta$ estimate and standard deviation of the estimate for fruit5
:
beta.s2 <- diag(vcmat) # extract variances for each parameter from vcmat
beta.s2[["fruit5"]] <- sum(vcmat_tiny) # compute variance of reference level as
# Var(f1 + f2 + f3 + f4) = Var(f1) + Var(f2) + Var(f3) + Var(f4) +
# 2Cov(f1,f2) + 2Cov(f1,f3) + 2Cov(f1,f4) + 2Cov(f2,f3) +
# 2Cov(f2,f4) + 2Cov(f3,f4)
beta.s <- sqrt(beta.s2) # calculate SD of parameter estimates as square root of variances
beta <- beta[ grepl("Intercept|fruit", names(beta)) ] # for simplicity, I will limit this to the factor
beta.s <- beta.s[ grepl("Intercept|fruit", names(beta.s)) ] # under sum-to-zero contrasts
This is where I start to really lose confidence in my calculations. I know (read: I think I know) that the half-width of the confidence interval is computed as $\bar{x}±z\frac{s}{\sqrt{n}}$.
I have $\bar{x}$—that's the vector of $\beta$ coefficients; I just calculated their standard deviations, $s$. $z$ is an easy calculation. That leaves $n$. Now, is this $n=N$? That is, the size of the entire sample, 2000? Or, does each level get its own $n$ corresponding to the number of cases in that level in the data? In fact, based on what I've experienced so far while trying to replicate the output of MASS::confint()
, it seems that neither of these is the case. Instead, I get closest to the values produced by that function's methods if I don't divide $s$ at all (effectively, $n=1$).
I'll go through all cases here.
Population $N$:z <- qnorm(1 - (alpha / 2)) # compute the critical z quantile corresponding to the desired confidence level
CIWidth.N <- z * beta.s / sqrt(N) # compute the half-width of the confidence interval (N=2000 for all parameters)
CIlo.N <- beta - CIWidth.N # calculate lower limit of CI
CIhi.N <- beta + CIWidth.N # calculate upper limit of CI
im1.OR_popN <- exp( cbind( OR = beta, # put it all together and exponentiate
`2.5 %` = CIlo.N,
`97.5 %` = CIhi.N ) ) %>%
as.data.frame.matrix( . ) %>%
mutate( Parameter = rownames( . ) ) %>%
select( Parameter, everything() )
Comparing these estimates + limits to those computed earlier I have:
## Parameter OR 2.5 % 97.5 % Parameter OR 2.5 % 97.5 %
## 1 (Intercept) 0.3082855 0.2492946 0.3794741 (Intercept) 0.3082855 0.3068414 0.3097363
## 2 fruit1 1.2874891 1.0121429 1.6273995 fruit1 1.2874891 1.2806799 1.2943345
## 3 fruit2 0.9159739 0.7543939 1.1079502 fruit2 0.9159739 0.9120491 0.9199156
## 4 fruit3 0.8640049 0.6672064 1.1066674 fruit3 0.8640049 0.8591390 0.8688983
## 5 fruit4 0.8450893 0.6919834 1.0273585 fruit4 0.8450893 0.8413667 0.8488284
## 6 <NA> NA NA NA fruit5 1.1613271 1.1565738 1.1661000
Here, the original profile (confint()
) estimates and limits are shown on the left; the corresponding Z-approximation ($n=N$) limits are on the right. Clearly, the latter intervals are far too small; $N$ is too big. Maybe the more appropriate $n$ to use is the $n_i$ in each level of the fruit
factor?
n_i <- table(.df$fruit) # determine the number of cases for each level of `fruit`
fruitlevels <- c(fruit1="durian", fruit2="guava", # draft a map of factor levels --> contrast labels
fruit3="papaya", fruit4="pomegranate",
fruit5="soursop" )
names(n_i) <- mapvalues( names(n_i), # rename the table of cases according to that map
fruitlevels, names(fruitlevels) )
n_i[["(Intercept)"]] <- mean(n_i) # add an `n` for the 'average' case as the 'average'
n_i <- n_i[ match( names(beta.s), names(n_i) ) ] # of the `n`s????; rearrange to match beta.s
CIWidth.n_i <- z * beta.s / sqrt(n_i) # compute the CI half-width using level-specific `n`s
CIlo.n_i <- beta - CIWidth.n_i # calculate the lower confidence limit
CIhi.n_i <- beta + CIWidth.n_i # calculate the upper confidence limit
im1.OR_n_i <- exp( cbind( OR = beta, # put everything together and exponentiate
`2.5 %` = CIlo.n_i,
`97.5 %` = CIhi.n_i ) ) %>%
as.data.frame.matrix( . ) %>%
mutate( Parameter = rownames( . ) ) %>%
select( Parameter, everything() )
Again, I'll compare these to the limits estimated using profile methods:
## Parameter OR 2.5 % 97.5 % Parameter OR 2.5 % 97.5 %
## 1 (Intercept) 0.3082855 0.2492946 0.3794741 (Intercept) 0.3082855 0.3050658 0.3115392
## 2 fruit1 1.2874891 1.0121429 1.6273995 fruit1 1.2874891 1.2680903 1.3071846
## 3 fruit2 0.9159739 0.7543939 1.1079502 fruit2 0.9159739 0.9081644 0.9238505
## 4 fruit3 0.8640049 0.6672064 1.1066674 fruit3 0.8640049 0.8504203 0.8778065
## 5 fruit4 0.8450893 0.6919834 1.0273585 fruit4 0.8450893 0.8375471 0.8526995
## 6 <NA> NA NA NA fruit5 1.1613271 1.1519695 1.1707607
Still too narrow! Again, the confint()
limits are on the left; the corresponding $n=n_i$ limits are on the right.
I guess in the extreme case, $n=1$. I'll try that.
Singular $n$:CIWidth.1 <- z * beta.s / sqrt(1) # compute the half-width of the confidence interval using n=1
CI.lo1 <- beta - CIWidth.1 # calculate the lower confidence limit
CI.hi1 <- beta + CIWidth.1 # calculate the upper confidence limit
im1.OR_1 <- exp( cbind( OR = beta, # put everything together and exponentiate
`2.5 %` = CI.lo1,
`97.5 %` = CI.hi1 ) ) %>%
as.data.frame.matrix( . ) %>%
mutate( Parameter = rownames( . ) ) %>%
select( Parameter, everything() )
One last time! Compare:
## Parameter OR 2.5 % 97.5 % Parameter OR 2.5 % 97.5 %
## 1 (Intercept) 0.3082855 0.2492946 0.3794741 (Intercept) 0.3082855 0.2498973 0.380316
## 2 fruit1 1.2874891 1.0121429 1.6273995 fruit1 1.2874891 1.0156684 1.632056
## 3 fruit2 0.9159739 0.7543939 1.1079502 fruit2 0.9159739 0.7559328 1.109898
## 4 fruit3 0.8640049 0.6672064 1.1066674 fruit3 0.8640049 0.6711598 1.112260
## 5 fruit4 0.8450893 0.6919834 1.0273585 fruit4 0.8450893 0.6936808 1.029545
## 6 <NA> NA NA NA fruit5 1.1613271 0.9667092 1.395126
Much closer; now the CIs are just barely too wide, and appear to be slightly up-shifted.
To reiterate my questions:
- Is $N=1$ the way to go here? If so, why is that the case? My intuition is that $N$ might be the number of cases falling into the particular category for which the estimate is being made.
- If $N=1$ is not the way to go, where am I going wrong?
(If you've made it this far, thank you for reading!)