I have a linear model of the form y ~ x + z + x:z.
I have unbalanced data and further have a few missing cells (data would be unbalanced even without those missing cells).
My understanding of the different Types of SS comes mostly from how you can manually obtain the SS using different model comparisons in R. For example, to obtain the Type III SS associated with z, you would compare a model including x and x:z to a model including x, z, and x:z. More details on this procedure can be found in this answer.
I understand this model comparison method (for Type III SS) as throwing away any error that cannot be unambiguously assigned to only one term. I can perform this method on data with missing cells, and the Anova function in the car package in R will compute the anova table for missing cells, but I have read some places (e.g. here) that Type III SS should not be used for missing cells data.
- Is it really wrong to use Type III SS with missing cells, or just not normally appropriate, and why?
- How is the interpretation of an anova table using Type III SS with missing cells different from one without missing cells (but still having unbalanced data)?
EDIT
I performed the following simulation in R comparing type I error rates for different patterns of missing observations in synthetic data (balanced, observations missing completely at random, cells missing completely at random, and both cells and observation missing at random, all with the same number of observations). From the results of this I don't see a problem with using Type III SS on data with missing cells. The type I error rates don't seem to be violated for any of the data sets.
For one run I got the following observed Type I error rates:
intercept effect A effect B
balanced .056 .058 .039
missing obs .052 .052 .039
missing cells .050 .055 .044
missing obs and cells .056 .050 .049
In a different setup I added a main effect for B and the neither the power nor the observed alpha level seem to be affected by missing cells.
Am I not understanding something important here?
Here is the code I used:
library(car) # for Type III SS
# Data properties (simulation may not work if changing these)
nLevelsA <- 5
nLevelsB <- 5
nreps <- 5
factA <- rep(1:5, times = nLevelsA*nreps)
factB <- rep(1:5, each = nLevelsB*nreps)
cell <- factor(paste(factA, factB, sep = ""))
nSims <- 1000 # Number of repetitions of the simulation
# Empty output objects
mod1P <- matrix(rep(NA, times = nSims * 3), ncol = 3)
mod2P <- matrix(rep(NA, times = nSims * 3), ncol = 3)
mod3P <- matrix(rep(NA, times = nSims * 3), ncol = 3)
mod4P <- matrix(rep(NA, times = nSims * 3), ncol = 3)
# Create four datasets, all based on the same values, but different patterns of cell counts
for(i in 1:nSims){
myData0 <- data.frame(y = rnorm(nLevelsA*nLevelsB*nreps, 0, 1), A = factA, B = factB, cell = cell)
# Use this to add a main effect for B
# for(m in 1:nLevelsB){
# myData0[which(myData0$B == m), 1] <- myData0[which(myData0$B == m), 1] + m
# }
# Randomly remove one observation from each cell
myData1 <- myData0
for(j in 1:length(levels(cell))){
# randomly pick one row from each cell
rowNums <- 1:nrow(myData1)
rowToDrop <- sample(rowNums[which(myData1$cell == levels(cell)[j])], 1)
myData1 <- myData1[-rowToDrop, ]
}
# Randomly remove observations without regard to cell
myData2 <- myData1[sample(1:nrow(myData0), nrow(myData1), replace = FALSE), ]
# Randomly empty cells from balanced data
myData3 <- myData0
cellToDrop <- sample(levels(myData0$cell), 5)
for(k in 1:length(cellToDrop)){
myData3 <- myData3[which(myData3$cell != cellToDrop[k]), ]
}
# Randomly empty cells from unbalanced data
myData4 <- myData0
cellToDrop <- sample(levels(myData0$cell), 3)
for(l in 1:length(cellToDrop)){
myData4 <- myData4[which(myData4$cell != cellToDrop[l]), ]
}
myData4 <- myData4[sample(1:nrow(myData4), nrow(myData1), replace = FALSE), ]
# nrow(myData0 # basis to start with to get the same number observations in each other dataset
# nrow(myData1) # balanced
# nrow(myData2) # observations missing at random
# nrow(myData3) # cells missing at random
# nrow(myData4) # cells and observations missing at random
#
#
# xtabs(~ A + B, data = myData0)
# xtabs(~ A + B, data = myData1)
# xtabs(~ A + B, data = myData2)
# xtabs(~ A + B, data = myData3)
# xtabs(~ A + B, data = myData4)
mod1 <- lm(y ~ A + B, data = myData1)
mod2 <- lm(y ~ A + B, data = myData2)
mod3 <- lm(y ~ A + B, data = myData3)
mod4 <- lm(y ~ A + B, data = myData4)
# P-values for intercept, factor A, and factor B
mod1P[i, ] <- Anova(mod1, type = 3)$'Pr(>F)'[1:3]
mod2P[i, ] <- Anova(mod2, type = 3)$'Pr(>F)'[1:3]
mod3P[i, ] <- Anova(mod3, type = 3)$'Pr(>F)'[1:3]
mod4P[i, ] <- Anova(mod4, type = 3)$'Pr(>F)'[1:3]
}
# Count how many times a significant test result is found
p1 <- mod1P <= .05
p2 <- mod2P <= .05
p3 <- mod3P <= .05
p4 <- mod4P <= .05
# Get proportion of times a significant test result is found
pobs1 <- apply(p1, MARGIN = 2, FUN = sum) / nSims
pobs2 <- apply(p2, MARGIN = 2, FUN = sum) / nSims
pobs3 <- apply(p3, MARGIN = 2, FUN = sum) / nSims
pobs4 <- apply(p4, MARGIN = 2, FUN = sum) / nSims
# Proportion of experiments observed as significant at 0.05 level.
# Intercept, then factor A, then factor B
pobs1 # balanced
pobs2 # observations missing at random
pobs3 # cells missing at random
pobs4 # cells and observations missing at random