Controlling for a variable with unequal sample size

Question

I'm dealing with a Poisson regression model for count data. My DV is the number of Covid infections in a particular city. My explanatory variables are both categorical and continuous. One of them is the region which the city belongs to. Of course, I cannot have the same number of cities for each region.

My strategy is to create a multilevel factor called "Region" and assign "1" if the i-th city belongs to the k-th region and "0" otherwise, but I wonder if this unequal distribution over the regions (consider that in a couple of cases there are just 1 or 2 cities per region while there are cases with 9-10 cities per region) may affect my results and in which way.

Do you have both region AND city within region in your model? If so, are you using a nested model and are you accounting for the homogeneity within cities and regions using some sort of GEE/marginal model or a mixed effects model? If not, you probably should. Otherwise, your variance estimates are likely to be quite poor and too small. — StatsStudent, Jul 18 '20 at 17:50
I just have the variable "city" and the variable "region", a factor with 20 levels. I have other variables too, but I want to control for the region effect too. Isn't the mixed effect model used for panel data? My idea was just to get 20 different intercepts, one for region, although I'm also thinking of aggregating the region into bigger "clusters", because 20 intercepts (not considering the other predictors) seem a lot to me, I only have about 200 obs. — maestus, Jul 26 '20 at 09:14

score 1 · Answer 1 · answered Jul 18 '20 at 17:17

If region is one of your factor of interest and you'd like to examine if cases are more frequent in one region over the other, then it's fine to have different regions presented in unequal proportion. It is true that power is maximized when groups are equal, but this is usually not achievable in observational data. If region A is bigger, it'd have more cities, there is nothing we can do about that. What's more important is that your data are comprehensive and representative.

When analyzing disease incidence, we should also be mindful of two other items: i) what is the time frame? And ii) what is the population at risk? Poisson distribution allows us to estimate the chance of seeing a certain number of events given its mean rate over a certain period of time. And in order to properly compare the disease risks across cities, we need to make sure the space and time are comparable, so that the counts would be comparable.

For time, make sure the statistics to be present agree with the data inclusion criteria. Are you looking at total prevalence (all cases ever)? Or incidence rate in June of 2020? Whatever that is, make sure the data cover the same period across the cities.

For space, make sure you have included the population at risk (in the form of natural log(population)) in the regression model as an offset, also this thread. Or you'll just be computing the ratio of raw counts rather than the ratio of rates.

Thank you! For space, I thought that my offset variable would be the region area (rather than the density which will be an "usual" predictor); for time, I'm going to deal with raw counts (the total infections over 10 weeks of observation for each city): is this a good approach? — maestus, Jul 25 '20 at 08:37
And I forgot to mention (but I don't see how it could affect the results) that my model controls for geographical proximity (it's a Poisson cross-regression model), so I will have a weights matrix with a set of spatially lagged predictors too. — maestus, Jul 25 '20 at 08:42

Controlling for a variable with unequal sample size

1 Answers1