Lets say I have a dataframe that looks like this:
groups <- floor(runif(1000, min=1, max=5))
activity <- rep(c("A1", "A2", "A3", "A4"), times= 250)
endorsement <- floor(runif(1000, min=0, max=2))
value1 <- runif(1000, min=1, max=10)
area <- rep(c("A", "A", "A", "A", "B", "C", "C", "D", "D", "E"), times = 100)
df <- data.frame(groups, activity, endorsement, value1, area)
printed:
> head(df)
groups activity endorsement value1 area
1 1 A1 0 7.443375 A
2 1 A2 0 4.342376 A
3 1 A3 0 4.810690 A
4 4 A4 0 3.494974 A
5 3 A1 1 6.442354 B
6 1 A2 0 9.794138 C
I want to run a logistic regression (predicting endorsement
from groups
), but if you look at the area
variable, A
is very well represented, whereas B
and E
are not.
I'm not interested in the area
variable itself, but the stats will be driven by areas that have high representation in the dataset, so I need to weight the data but I'm not sure the correct way to do it
This is the model I'd like to run:
library(lsmeans)
model <- glm(endorsement ~ factor(groups), data=df, family=binomial(logit))
anova(model, test = "Chisq")
lsmeans(model, pairwise ~ groups)
Without any adjustment, the "main effect" of groups
and any pairwise differences will primarily be driven by any effects found in the most represented area
(in the actual dataset area A has about 100x more subjects than any other area)
Whats the correct way to adjust for the unbalanced area
representation? I thought about upsampling the minority groups (or even downsampling the majority group) but I feel like this would have adverse/artificial effects on the power of the test?