0

I'm trying to estimate a difference-in-difference model with pooled cross-sectional data. In order to test the parallel trends assumption, I'm following Common trend assumption by running

Y_it = Household FE + time FE + Sum(j !=k) delta_j * Treated * I(t=j) + X'b + error

My question is the following: in order to add household fixed effects with pooled cross-sectional data, should I create a unique 'household' factor for each year and then combine, or do so with the already combined data?

  • Welcome. I would combined all years and estimate separate dummies for each household. A unique ‘household-year’ effect might fully saturate the model. Do you observe individuals embedded *within* these households? If possible, could you show us a small subset of your data? – Thomas Bilach Jul 27 '20 at 22:45
  • Yes, you can observe individuals within households. Also, from year to year, only about 25% of the households change, but the data is still anonymyzed so I can't quite follow people through the years. I'm just wondering how to implement household fixed effects given that. Also, if I may ask, how do I share parts of my data? – luisdiego24 Jul 28 '20 at 03:46
  • It is not necessary to do this. How many years of data do you have? And, is treatment at the household level? – Thomas Bilach Jul 28 '20 at 03:58
  • I have 22 years'-worth of data, with a staggered intervention at the district level that started on year 9 @ThomasBilach – luisdiego24 Jul 28 '20 at 17:23
  • So it’s households embedded *within* districts? Please clarify. In settings with staggered treatment adoption, it requires a full set of dummies for all households *and* a full set of dummies for all years. – Thomas Bilach Jul 28 '20 at 17:36
  • That is correct, households within districts, and the intervention is at the district level. – luisdiego24 Jul 28 '20 at 17:46
  • Is there a well-defined level of aggregation at the household level? If one particular district receives the intervention, are all households within that district considered treated? If it is well-defined, then household dummies is the way to go. – Thomas Bilach Jul 28 '20 at 18:17
  • I must confess I don't entirely understand your first question, but for your second the answer is yes. I will create the household dummies, though my initial question was whether to create those dummies for households each year and then put it all together into a big data set, or to create the dummies in the whole pooled crossed section data set. Thanks again! – luisdiego24 Jul 28 '20 at 18:45
  • You're estimating a difference-in-differences equation. In most settings, the data is usually 'aggregated up' to a higher level. Treatment in this case is at the *district* level. Why not use *district* and *year* fixed effects? Also, it appears a new subset of households are treated each year as you obtain a new cross-section, though most will be treated multiple times. Thus, across all years a household might only appear once in the dataset. A dummy for that household would be a singleton, and clustering at the household level would be problematic. Is district fixed effects not appropriate? – Thomas Bilach Jul 28 '20 at 21:28
  • I believe you're right. I'll try district fixed effects. I suppose I was trying to find a way to implement household fixed effects only to run the regression specified in the "Common trends assumption" link to check for that assumption. That said, I realized it makes little sense to talk about household fixed effects without using a panel, so I'll use district effects instead. Thanks for your help! – luisdiego24 Jul 28 '20 at 21:53
  • I included a response which I think might help. Let me know if anything is unclear. – Thomas Bilach Jul 28 '20 at 22:56

1 Answers1

1

You're estimating a difference-in-differences (DiD) equation. In most settings, the data is usually 'aggregated up' to a higher level. Treatment in this case is at the district level. Your treatment should affect all households within each district. I would recommend estimating your model using a full set of dummies for all districts and full set of dummies for all years.

Here is what I think you want to estimate:

$$ y_{idt} = \gamma_{d} + \lambda_{t} + \delta D_{dt} + \theta X_{idt} + \epsilon_{idt}, $$

where you observe $i$ households within $d$ districts across $t$ years. $\gamma_{d}$ and $\lambda_{t}$ are fixed effects for districts and years, respectively. Your treatment dummy $D_{dt}$ is at the district-year level. All we need is a sample of households in the relevant districts $d$ in the various years $t$. The intervention (treatment) is well-defined at this higher level of aggregation; it affects all households embedded within districts. The coefficient on $\delta$ is your treatment effect.

In your question, however, you indicate that you want to estimate household fixed effects. You could certainly estimate a model with fixed effects at the $i$-th level, but it will not yield the same DiD estimate.

The equation you are considering is the following:

$$ y_{it} = \alpha_{i} + \lambda_{t} + \delta D_{it} + \theta X_{it} + \epsilon_{it}, $$

where $\alpha_{i}$ now represents household fixed effects. $\gamma_{d}$ is not included in this specification; it will be absorbed by the household fixed effects. There are occasions where inclusion of the individual (household) effects yield identical DiD coefficients. I encourage you to review this post for an example of this.

Your setting is different. We do not observe the same households over time. In one particular year $t$ you might observe a sample of households $i$ from district $d$. In year $t+1$ you sample a new cross-section of households, though it is likely that many households will be sampled again the following year. But you indicated in the comments that nearly one-quarter of each cross-section is a completely new subset of households. Thus, you are not observing the same households over time. Because of this, your estimates will differ. See the second paragraph under Section 1.5 of these lecture notes for more information.

We can think about this more simplistically. Suppose you sample two households from a treated district in 2018, which I will call H1 and H2. In 2019, you resample households again and you observe H1 and H3. You repeat this process yet again and obtain H1 and H2. Note, H3 is never sampled again. If you included dummies for all unique households, then H3 is now a singleton dummy. It is observed in one time period. Again, you could estimate this model, but it will not return the same DiD estimate from the former model where the data is 'aggregated up' to the district level. It also makes assessing parallel trends difficult as the composition of your treatment group is changing over the years.

In sum, you could still estimate this model using household fixed effects. If you sampled a large number of households in each year, and most will be resampled anyway, then you could restrict your sample to households where you have repeated observations over the 22-year period. This ensures you are observing the same households pre- and post-intervention. I also recommend clustering at the district level!

Thomas Bilach
  • 4,732
  • 2
  • 6
  • 25