General setting: I am interested in estimating the effect of D
and would like to use within group
variation, which is more plausibility random in this setting.
Are small and unbalanced groups a concern in this case (unbiasedness, consistency etc)? For example, assume N is 200k, there are 56k groups, the average group size is 3.6 (median is 2.0) and almost 50% of groups have just one 1 observation. Of course, there is no within variation for groups with one observation.
Any literature that discusses this case would be helpful. Wooldridge has brief discussions of the "large N and small T" case in his books but they don't answer my questions.
Note: There are various related questions such as this but the discussion focuses on random and not fixed effects models.
Illustration with simulated data
This code simulates data with similar group structure (large number of groups with 1 observation, unbalanced group size)
library("tidyverse")
library("gamlss.dist")
library("fixest")
set.seed(123)
n_groups <- 50000
prop_one <- 0.25
groups <- tibble(
group = 1:n_groups,
obs = rZIP(n_groups, 0.9, prop_one),
) %>%
mutate(obs = round(ifelse(obs == 0, 1, exp(obs))))
b <- 1.2
data <- groups %>%
uncount(obs) %>%
mutate(
D = rbinom(n(), 1, 0.5),
y = D * b + rnorm(n(), mean = 0, sd = 10)
)
model <- feols(y ~ D | group, data = data)
summary(model)
In this setting, my question is whether estimating the effect of $D$ is problematic using a within estimator such as a fixed effect regression model $y_i = \alpha_i + \beta * D + \epsilon_i$ because there is a large proportion of groups with 1 observation and many groups with small N. Of course, with an increasing number of groups with one observation (higher prop_one
parameter), uncertainty in the estimates will increase as well. But does the estimator remain unbiased and consistent? Any other concerns? Any literature that discusses this situation and might point at thresholds?