Fixed Effects for fractual response variable with many zero observations

Question

I am investigating the impact of some independent variables on educational expenditure shares, which is given as the proportion of $\frac{educational\_expenditure_i}{total\_expenditures_i}$. The response variable (share) has $N = 2614$ observations with $890$ zero observations. The data is an unbalanced panel, consisting of two waves. Here is a histogram of the response variable:

The zero observations are most likely distinct consumer choices not to invest any money in education, hence I want them to be represented in my model.

Since the data comes in (unbalanced) panel form, I'd like to estimate the model with and without Fixed Effects. I have $T = 2$ time periods and $K = 1565$ groups.

What I did so far in $Stata$:

1) Tobit Model: Assuming that there is an underlying latent variable that drives the consumer to make a certain (zero-consumption) choice. Even though this gave me reasonable and overall very significant results, there is some critic that the Tobit model should just be used, if observations below the threshold are theoretically possible. Also estimating Fixed Effects in the Tobit framework is apparently not recommended since its theoretical properties are poor.

2) OLS Fixed Effects: As I understood standard OLS does not account fully for the characteristics of the response variable. The results are also statistically insignificant.

xtreg shares IV i.year, fe vce(cluster)

$IV$ represents the independent variables.

3) Fractional logit model: Following Patke and Wooldridge (2008). This allows to take the fractional characteristic into account as well as the zero observations. However estimating Fixed Effects in this framework seems to be cumbersome and I did not find a satisfying solution yet.

Panel regression, which led to similar results (in the margins) as the Tobit regression, but with much smaller coefficients:

glm share IV i.year, link(logit) family(binomial) vce(cluster) nolog eform
margins, dydx(*)

Fixed effects (taken from https://www.statalist.org/forums/forum/general-stata-discussion/general/1446970-fractional-logit-model-unbalanced-panel-two-way-fixed-effects):

xtgee share IV c.hhID i.year c.ID##i.year, family(bin) link(logit) 
corr(independent) i(ID) t(year) vce(robust)

whereby $ID$ is the panel variable, and $year$ the time variable. I have trouble to understand the model specification and also to interpret the results correctly, specifically the interaction term and categorial variables: c.hhID i.year c.ID##i.year. I've also read that Fixed Effects in the $xtgee$ framework should be applied to a balanced panel. However I've run the Verbeek-Nijman test on my panel and confirmed that the attrition is random and thus proceeded to use the unbalanced panel.

My questions are:

1) How would you model such a response variable? I've also read about Zero-inflated beta models, or Poisson regressions. I'd like to first run a regression on the panel and then apply Fixed Effects to compare the results.

2) What is the best approach to apply Fixed Effects in the $GLM$, or in your proposed framework?

Please let me know if I should add some results, figures, information.

Could you please explain what an "educational expenditure share" is and exactly how it is measured? How would this relate to the term "fractual" in the title? — whuber, Jun 23 '18 at 12:41
It is given as the proportion of $\frac{educational\_expenditure_i}{total\_expenditures_i}$. It may be good to refer in the title to proportional data then? — XsLiar, Jun 23 '18 at 12:47
I don't think it's quite what people think of "proportional": *ratio* might be better. There is a possibility that this ratio could lie outside the interval from $0$ to $1$ or even be undefined. In many cases ratios of random variables have highly skewed distributions, making them challenging to use directly with most regression procedures, whereas the logarithms of those ratios typically come closer to what is assumed of responses in regression-fitting procedures. — whuber, Jun 23 '18 at 19:50
It seems like I do not grasp this point. Using logarithms (log-transformation) will always impose the problem of how to handle the 0 values. From my understanding it is important to check whether the 0 observations are actual data values or missing values. In my case, they are certainly actual data values, so I am afraid that a log transformation of my dependant variable, will not take this into account. — XsLiar, Jun 24 '18 at 12:25
That has been addressed in many threads: see, *inter alia,* https://stats.stackexchange.com/questions/4831, https://stats.stackexchange.com/questions/30728, https://stats.stackexchange.com/questions/49443, and--generally--many of the threads found by searching https://stats.stackexchange.com/search?tab=votes&q=user%3a919%20log%20regression%20zero. — whuber, Jun 24 '18 at 14:08

Fixed Effects for fractual response variable with many zero observations

0 Answers0

Linked