1

I am currently thinking about a theoretic problem I cannot get my head around, so I am here hoping to find a statistical mastermind, which can help me in this regard.

A classical thing to do in regression analysis is to include an interaction term of two variables. To see whether the base effect of one variable is different for a specific subsample, one has to include the base specifications of this very variable. So, let us assume we want to check the effect of gender(1 = male) and marital status(1 = married) on wage. We want to furthermore check whether the effect of gender on wage is more pronounced when being a male. We would estimate something like

wage ~ gender + marital status + gender*marital status

With this kind of setup we can easily get coefficient estimates for each of the variables.

However, let us now think about a different example, where we want to build a specification with fixed effects, for let us say year and state (doesn't matter really, because I think my questions is universally applicable), to control for unoberseved heterogeneity between states and through time. When we would also want to control for different time trends for each of the states, one would need to include state x year-fixed effects. The model to estimate would then look basically something like this:

wage ~ factor(year) + factor(state) + factor(year*state)

So for me, it is very similar to the first one. However, you cannot estimate the coefficients of the year-dummies and state-dummies - this is why in all of econometric research, when the FE of the interaction of e.g., year and state is included then the single FEs are not.

This is what I cannot get my head around, the two designs look basically the same, for both we have the single variables and an interaction between those. For one model you can estimate all coefficients, for the other you can not.

My guess is that it has to do with the fact that in the first case we would include gendermarital status as one new variable, which is always 0 when e.g., gender is female. This would mean, inside of the 0-values of the new variable gendermarital, we have variation within marital, which is why we could estimate the coefficient.

For the second case, we include a dummy-combination for all of the observations, meaning that for each of the given year-state-combinations (which we include all in the model) we have then no variation of course through states, which is why we cannot include this variable.

So to sum up, for me it feels like it has to do with the fact, that we include all of the possible cases as factors in the second case, we create a new variable in the first case that is always zero when one the variables is zero, withouth distinguishing between whether it is 0 - 0 or 0 - 1.

Any confirmation of these assumptions? Maybe someone can even write it down in a mathematical way that makes this more clearer, or refer me to some books that explain this. Everything helps..

Max
  • 107
  • 5

1 Answers1

0

Not really an answer, but too long for a comment: isn't a bit like the other question we have discussed some time ago?

The way I think of state-year interactions would be as follows (I am not fully sure, though, if this is how state-year interactions precisely are to be generated):

state.dummies <- matrix(c(rep(1,4), rep(0,4), rep(0,4), rep(1,4)), ncol=2)
time.dummies <- matrix(c(rep(c(1,0,0,0),2), rep(c(0,1,0,0),2), rep(c(0,0,1,0),2), rep(c(0,0,0,1),2)), ncol=4)
state.time.dummies <- cbind(state.dummies[,1]*time.dummies, state.dummies[,2]*time.dummies)

all.effects <- cbind(state.dummies,
      time.dummies,
      state.time.dummies
)

Giving:

> all.effects <- cbind(state.dummies,
+       time.dummies,
+       state.time.dummies
+ )

> state.dummies
     [,1] [,2]
[1,]    1    0
[2,]    1    0
[3,]    1    0
[4,]    1    0
[5,]    0    1
[6,]    0    1
[7,]    0    1
[8,]    0    1

> time.dummies
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1
[5,]    1    0    0    0
[6,]    0    1    0    0
[7,]    0    0    1    0
[8,]    0    0    0    1

> all.effects
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,]    1    0    1    0    0    0    1    0    0     0     0     0     0     0
[2,]    1    0    0    1    0    0    0    1    0     0     0     0     0     0
[3,]    1    0    0    0    1    0    0    0    1     0     0     0     0     0
[4,]    1    0    0    0    0    1    0    0    0     1     0     0     0     0
[5,]    0    1    1    0    0    0    0    0    0     0     1     0     0     0
[6,]    0    1    0    1    0    0    0    0    0     0     0     1     0     0
[7,]    0    1    0    0    1    0    0    0    0     0     0     0     1     0
[8,]    0    1    0    0    0    1    0    0    0     0     0     0     0     1

> qr(all.effects)$rank
[1] 8
> qr(state.time.dummies)$rank
[1] 8

So basically, with interactions, you allow for a separate fixed effect for each state-year combination (eight of them here), while the conventional two-way effect model has an additive structure giving rise to only 2+4=6 effects (one of which is lost due to collinearity).

If you tried both (like in all.effects), you would still only have rank 8, so no way to additionally allow for conventional fixed and time effects.

Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106