I am working on analyzing panel data of countries with several independent variables. I am aware from previous literature on panel data that an OLS model could be performed. However, because the estimation results for count data will suffer bias in OLS where dependent values are allowed to take both negative and positive values, I chose the Poisson model. I have dependent variables that are of count data, over dispersed, and having excess of zeroes.
I performed a count data hurdle regression with Poisson and to prevent multicollinearity, variance inflation factors are checked to arrive at my optimal model. Every single output that I got (be it univariate or multivariate analysis) had narrow confidence intervals, low coefficents and extremely low p values (< 2.2e-16). Here is my final model -
Call:
hurdle(formula = Y ~ A+B+C+
offset(log(Pop.Density)) | 1, data = Data, dist = "poisson")
Pearson residuals:
Min 1Q Median 3Q Max
-0.7931 -0.7807 -0.7401 0.2798 301.3965
Count model coefficients (truncated poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.694339 0.004247 -163.47 <2e-16 ***
A -0.396599 0.005468 -72.53 <2e-16 ***
B -0.328605 0.004792 -68.58 <2e-16 ***
C -0.240072 0.004170 -57.58 <2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4635 0.0393 -11.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 18
Log-likelihood: -1.381e+05 on 5 Df
> vif(Mv7)
A B C
3.564376 2.419123 3.663317
My primary question is, what could be causing these p values to be so low in my output? My dataset is not extremely huge, but I would say fairly large (3083 obs of 15 variables). A, B,C have values that range from negative to positive.
Could there be a time effect that effect my results that I did not consider?
> Call:
hurdle(formula = Y ~ A+B+C+ offset(log(Pop.Density)) | 1, data = Data, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.6743 -0.3575 -0.3489 -0.2177 37.8530
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.67061 0.06041 11.100 < 2e-16 ***
A 0.16976 0.10964 1.548 0.1215
B -0.43295 0.09676 -4.474 7.66e-06 ***
C -0.13336 0.07475 -1.784 0.0744 .
Log(theta) -1.06590 0.06579 -16.201 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4635 0.0393 -11.79 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.3444
Number of iterations in BFGS optimization: 25
Log-likelihood: -8014 on 6 Df
> vif(Mv7)
A B C
5.387540 3.981045 2.848343
When I use population total as an offset, I get an error in R - Error: no valid set of coefficients has been found: please supply starting values
When I use log(Population Total) as an offset this is what I get -
>
Call:
hurdle(formula = Y ~ A+B+C+
offset(log(Pop.Total)), data = Data, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.64145 -0.40354 -0.23862 -0.07396 250.78934
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.98678 0.05473 -219.016 < 2e-16 ***
A -0.33166 0.09544 -3.475 0.000511 ***
B -0.22369 0.09157 -2.443 0.014567 *
C 0.55690 0.07117 7.825 5.07e-15 ***
Log(theta) -0.84623 0.06062 -13.960 < 2e-16 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.44459 0.05300 -310.294 < 2e-16 ***
A -0.38690 0.09239 -4.188 2.82e-05 ***
B 0.12640 0.08411 1.503 0.1329
C -0.13685 0.07855 -1.742 0.0815 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.429
Number of iterations in BFGS optimization: 43
Log-likelihood: -7389 on 9 Df
> AIC(Mv8)
[1] 14795.51
> vif(Mv8)
A B C
3.815796 3.371319 3.406296
This is my outcome without using an offset.
Call:
hurdle(formula = Y~A+B+C,
data = Data, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-0.8101 -0.4350 -0.3287 -0.1220 21.8952
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.27904 0.04229 101.191 < 2e-16 ***
A -0.01475 0.07926 -0.186 0.8524
B -0.11324 0.05358 -2.113 0.0346 *
C -0.49852 0.06001 -8.308 < 2e-16 ***
Log(theta) -0.28185 0.04783 -5.893 3.8e-09 ***
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.55545 0.04305 -12.904 <2e-16 ***
A 0.09582 0.07769 1.233 0.217
B 0.06450 0.06931 0.931 0.352
C -0.97937 0.06872 -14.251 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.7544
Number of iterations in BFGS optimization: 18
Log-likelihood: -7558 on 9 Df
> AIC(Mv7)
[1] 15133.16
> vif(Mv7)
A B C
3.683669 2.178420 2.897415
Offset with population total scaled to 1. ( I divided each country's population total with largest population) -
Call:
hurdle(formula = Y~A+B+C+
offset(Pop.Total.Test), data = POP, dist = "negbin", link = "logit")
Pearson residuals:
Min 1Q Median 3Q Max
-0.88522 -0.45546 -0.34142 -0.06619 18.90660
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.15278 0.03762 110.386 <2e-16 ***
A -0.11584 0.06897 -1.680 0.093 .
B 0.02039 0.05119 0.398 0.690
C -0.45935 0.05182 -8.864 <2e-16 ***
Log(theta) -0.07232 0.04557 -1.587 0.113
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.58007 0.04310 -13.458 <2e-16 ***
A 0.08308 0.07774 1.069 0.285
B 0.06679 0.06962 0.959 0.337
C -0.95615 0.06867 -13.923 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 0.9302
Number of iterations in BFGS optimization: 13
Log-likelihood: -7433 on 9 Df
> AIC(Mv7)
[1] 14884.03
For my second dependent variable (occurrence) - My optimal model was A+B+log(pop.density). To calculate occurrence rate, I offset it with log(pop.total). The code follows below.
Call:
hurdle(formula = Occurence ~ A +B+ log(Pop.Density) +
offset(log(Pop.Total)), data = Data, dist = "negbin", link = "logit")
Pearson residuals:
Min 1Q Median 3Q Max
-1.1211 -0.5798 -0.2758 0.0748 20.7457
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.08087 0.15542 -103.467 < 2e-16 ***
A -0.35991 0.07428 -4.846 1.26e-06 ***
B -0.13195 0.06224 -2.120 0.03399 *
log(Pop.Density) -0.21688 0.03435 -6.314 2.73e-10 ***
Log(theta) 0.38368 0.12342 3.109 0.00188 **
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.02224 0.16485 -97.194 < 2e-16 ***
A -0.28540 0.07408 -3.852 0.000117 ***
B -0.09963 0.07744 -1.287 0.198255
log(Pop.Density) -0.08668 0.03829 -2.264 0.023590 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.4677
Number of iterations in BFGS optimization: 12
Log-likelihood: -3090 on 9 Df
> AIC(Mvfrequency)
[1] 6198.895
> vif(Mvfrequency)
A B log(Pop.Density)
2.776789 3.469304 11.629956
The output when I remove population density as a factor.
Call:
hurdle(formula = Occurence ~ A+B + offset(log(Pop.Total)),
data = Data, dist = "negbin", link = "logit")
Pearson residuals:
Min 1Q Median 3Q Max
-1.04306 -0.57164 -0.26873 0.05331 18.46993
Count model coefficients (truncated negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -17.02648 0.06424 -265.048 < 2e-16 ***
A -0.38514 0.07957 -4.840 1.3e-06 ***
B -0.13069 0.06732 -1.941 0.0522 .
Log(theta) 0.14126 0.12476 1.132 0.2575
Zero hurdle model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.38060 0.05235 -312.893 < 2e-16 ***
A -0.31557 0.07265 -4.344 1.4e-05 ***
B -0.07864 0.07668 -1.026 0.305
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta: count = 1.1517
Number of iterations in BFGS optimization: 10
Log-likelihood: -3115 on 7 Df
> AIC(Mvfrequency)
[1] 6243.296
> vif(Mvfrequency)
A B
2.694534 3.328758