0

So I feel like this is a rather stupid question, but I can't find a straightforward explanation on the issue. When constructing a dichotomous dummy variable, how many observations should each category have? I'm working on a dataset of autocracies with in total 20 countries. I was thinking about incorporating a regional dummy variable. However some regions, for example Latin America or east Europe/Central Asia, only have one to three countries classified as autocracies in the cross section. I feel like incorporating a dummy variable would make credible estimation on the other variables impossible? Could someone provide me with a straightforward answer on the issue?

develarist
  • 3,009
  • 8
  • 31
  • You could use regularization, or look at the answers at https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen Sep 13 '20 at 17:13

1 Answers1

1

There is no minimum sample size requirement, although problems can ensue with smaller sample sizes. For example, normality becomes more and more important for inference as the sample size reduces. In the extreme case where there is only one observation in a category, the fitted coefficient is essentially determined by the single residual value at that location. Significance of the coefficient in the classic regression model then essentially boils down to whether that coefficient is extreme relative to the normal distribution of the remaining residuals. If the distribution is not normal, then the inference can be poor, and the central limit theorem cannot help because the estimate is based on a single observation and not an average. Bootstrapping can alleviate some of these problems.

BTW, this method (one observation in a dummy category) is the basis for certain kind of "event study" tests done in financial analysis, so it is not unusual.

Of course, as with all of statistics, a larger sample size (properly chosen) is better than a smaller one.

BigBendRegion
  • 4,593
  • 12
  • 22
  • (+1) If a value is rare then often a bootstrap sample will not even include it and the coefficient in question can’t be calculated as the predictor is constant. – Nick Cox Sep 13 '20 at 13:17
  • Since the residual is zero for the case of one observation, you have to bootstrap the remainder. Reference: https://academic.oup.com/jfec/article-abstract/2/3/451/775603?redirectedFrom=fulltext – BigBendRegion Sep 13 '20 at 13:25