1

General setting: I am interested in estimating the effect of D and would like to use within group variation, which is more plausibility random in this setting.

Are small and unbalanced groups a concern in this case (unbiasedness, consistency etc)? For example, assume N is 200k, there are 56k groups, the average group size is 3.6 (median is 2.0) and almost 50% of groups have just one 1 observation. Of course, there is no within variation for groups with one observation.

Any literature that discusses this case would be helpful. Wooldridge has brief discussions of the "large N and small T" case in his books but they don't answer my questions.

Note: There are various related questions such as this but the discussion focuses on random and not fixed effects models.

Illustration with simulated data

This code simulates data with similar group structure (large number of groups with 1 observation, unbalanced group size)

library("tidyverse")
library("gamlss.dist")
library("fixest")

set.seed(123)
n_groups <- 50000
prop_one <- 0.25
groups <- tibble(
        group = 1:n_groups,
        obs   = rZIP(n_groups, 0.9, prop_one),
    ) %>%
    mutate(obs = round(ifelse(obs == 0, 1, exp(obs))))

b <- 1.2
data <- groups %>%
    uncount(obs) %>%
    mutate(
        D = rbinom(n(), 1, 0.5),
        y = D * b + rnorm(n(), mean = 0, sd = 10)
    )

model <- feols(y ~ D | group, data = data)
summary(model)

In this setting, my question is whether estimating the effect of $D$ is problematic using a within estimator such as a fixed effect regression model $y_i = \alpha_i + \beta * D + \epsilon_i$ because there is a large proportion of groups with 1 observation and many groups with small N. Of course, with an increasing number of groups with one observation (higher prop_one parameter), uncertainty in the estimates will increase as well. But does the estimator remain unbiased and consistent? Any other concerns? Any literature that discusses this situation and might point at thresholds?

greg
  • 198
  • 1
  • 2
  • 12

1 Answers1

3

You will have a problem with estimating fixed group effects. Especially, when you have just one observation per group. To illustrate:

Group D value
1 1 10
2 0 0
3 1 11
4 0 1

If you fit a regression model like $\text{value}_i \sim \alpha_\text{group} + \beta * D + \epsilon_i$, then the parameters of the model are not-identifiable. E.g. $\hat{\alpha}_1 = 10$, $\hat{\alpha}_2 = 0$, $\hat{\alpha}_3 = 11$, $\hat{\alpha}_4 = 1$ and $\hat{\beta}=0$ gives a perfect fit/maximizes the likelihood, but so does e.g. $\hat{\alpha}_1 = 5$, $\hat{\alpha}_2 = 0$, $\hat{\alpha}_3 = 6$, $\hat{\alpha}_4 = 1$ and $\hat{\beta}=5$, as well as an infinite set of other solutions. If you then want to estimate a within group variation and want to assume that the within group variation varies across groups, things only get worse (with one observation per group, how just cannot tell apart between record and between group variability).

How does one resolve that issue? Options:

  1. Use random effects that make assumptions about how intercepts vary across groups (ideally after adjusting for all key aspects in which groups could differ in a way that would lead to a different outcome). This kind of setting with many observations and very few records per observation are one of the key motivations for random effects models. Key terms to look for are "exchangability". Note that random effects are not limited to assuming normally distributed random effects, but software is typically the most mature for that setting.
  2. Go Bayesian and use prior information to set at least weakly identifiable priors for all model parameters, which resolves the unidentifiability in he sense of lading to e.g. unique maximum-a-posteriori estimates.
  3. A combination and (1) and (2).
  4. Using some form of regularization (e.g. ridge, LASSO, elastic net regression etc.). This will in some sense be equivalent to Bayesian maximum a-posteriori estimation with a particular prior.
  5. Some kind of embedding approach as seen e.g. in the machine learning literature, where you find some kind of low level numeric representation for groups (e.g. via 1-, 2-, 3- or more continuous covariates instead of a fixed effect for every single group, e.g. group 1 might be [0.732, -0.442, 2.013] and group 2 [-1.234, 2.224, 0.0277] and so on). Such embeddings can sometimes be created via other tasks or in a neural network (with an embedding layer) when training for the task at hand (however, this would again need regularization in your case).
  6. Ad-hoc solutions like combining small groups e.g. based on some kind of similarity criteria. In some particular circumstances this can be quite reasonable.
Björn
  • 21,227
  • 2
  • 26
  • 65
  • Thanks! I just updated my question with more details. Your example literately has 0 within group variation, which of course does not work. Aside from increasing uncertainty in the estimates, what is the problem in situations with small N per group and many groups with 1 observation? – greg Jan 25 '22 at 15:38