Suspiciously low p values and narrow CIs

Question

I am working on analyzing panel data of countries with several independent variables. I am aware from previous literature on panel data that an OLS model could be performed. However, because the estimation results for count data will suffer bias in OLS where dependent values are allowed to take both negative and positive values, I chose the Poisson model. I have dependent variables that are of count data, over dispersed, and having excess of zeroes.

I performed a count data hurdle regression with Poisson and to prevent multicollinearity, variance inflation factors are checked to arrive at my optimal model. Every single output that I got (be it univariate or multivariate analysis) had narrow confidence intervals, low coefficents and extremely low p values (< 2.2e-16). Here is my final model -

        Call:
    hurdle(formula = Y ~ A+B+C+ 
        offset(log(Pop.Density)) | 1, data = Data, dist = "poisson")
    
    Pearson residuals:
         Min       1Q   Median       3Q      Max 
     -0.7931  -0.7807  -0.7401   0.2798 301.3965 
    
    Count model coefficients (truncated poisson with log link):
    
                              Estimate Std. Error z value Pr(>|z|)    
    (Intercept)              -0.694339   0.004247 -163.47   <2e-16 ***
    A                        -0.396599   0.005468  -72.53   <2e-16 ***
    B                        -0.328605   0.004792  -68.58   <2e-16 ***
    C                        -0.240072   0.004170  -57.58   <2e-16 ***
    
    Zero hurdle model coefficients (binomial with logit link):
    
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -0.4635     0.0393  -11.79   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Number of iterations in BFGS optimization: 18 
    Log-likelihood: -1.381e+05 on 5 Df
    > vif(Mv7)
                       A                         B                       C
                    3.564376                 2.419123                 3.663317

My primary question is, what could be causing these p values to be so low in my output? My dataset is not extremely huge, but I would say fairly large (3083 obs of 15 variables). A, B,C have values that range from negative to positive.

Could there be a time effect that effect my results that I did not consider?

    > Call:
    hurdle(formula = Y ~ A+B+C+ offset(log(Pop.Density)) | 1, data = Data, dist = "negbin")
    
    Pearson residuals:
        Min      1Q  Median      3Q     Max 
    -0.6743 -0.3575 -0.3489 -0.2177 37.8530 
    
    Count model coefficients (truncated negbin with log link):
                             Estimate Std. Error z value Pr(>|z|)    
    (Intercept)               0.67061    0.06041  11.100  < 2e-16 ***
    A                         0.16976    0.10964   1.548   0.1215    
    B                        -0.43295    0.09676  -4.474 7.66e-06 ***
    C                        -0.13336    0.07475  -1.784   0.0744 .  
    Log(theta)               -1.06590    0.06579 -16.201  < 2e-16 ***
    Zero hurdle model coefficients (binomial with logit link):
                Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  -0.4635     0.0393  -11.79   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Theta: count = 0.3444
    Number of iterations in BFGS optimization: 25 
    Log-likelihood: -8014 on 6 Df
    > vif(Mv7)
                 A                    B                         C
           5.387540                 3.981045                 2.848343

When I use population total as an offset, I get an error in R - Error: no valid set of coefficients has been found: please supply starting values

When I use log(Population Total) as an offset this is what I get -

    > 
    Call:
    hurdle(formula = Y ~ A+B+C+
        offset(log(Pop.Total)), data = Data, dist = "negbin")
    
    Pearson residuals:
          Min        1Q    Median        3Q       Max 
     -0.64145  -0.40354  -0.23862  -0.07396 250.78934 
    
    Count model coefficients (truncated negbin with log link):
                              Estimate Std. Error  z value Pr(>|z|)    
    (Intercept)              -11.98678    0.05473 -219.016  < 2e-16 ***
    A                        -0.33166    0.09544   -3.475 0.000511 ***
    B                        -0.22369    0.09157   -2.443 0.014567 *  
    C                         0.55690    0.07117    7.825 5.07e-15 ***
    Log(theta)               -0.84623    0.06062  -13.960  < 2e-16 ***
    Zero hurdle model coefficients (binomial with logit link):
                              Estimate Std. Error  z value Pr(>|z|)    
    (Intercept)              -16.44459    0.05300 -310.294  < 2e-16 ***
    A                         -0.38690    0.09239   -4.188 2.82e-05 ***
    B                          0.12640    0.08411    1.503   0.1329    
    C                         -0.13685    0.07855   -1.742   0.0815 .  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Theta: count = 0.429
    Number of iterations in BFGS optimization: 43 
    Log-likelihood: -7389 on 9 Df
    > AIC(Mv8)
    [1] 14795.51
    > vif(Mv8)
              A                        B                        C
           3.815796                 3.371319                 3.406296

This is my outcome without using an offset.

    Call:
    hurdle(formula = Y~A+B+C, 
        data = Data, dist = "negbin")
    
    Pearson residuals:
        Min      1Q  Median      3Q     Max 
    -0.8101 -0.4350 -0.3287 -0.1220 21.8952 
    
    Count model coefficients (truncated negbin with log link):
                             Estimate Std. Error z value Pr(>|z|)    
    (Intercept)               4.27904    0.04229 101.191  < 2e-16 ***
     A                       -0.01475    0.07926  -0.186   0.8524    
     B                       -0.11324    0.05358  -2.113   0.0346 *  
     C                       -0.49852    0.06001  -8.308  < 2e-16 ***
    Log(theta)               -0.28185    0.04783  -5.893  3.8e-09 ***
    Zero hurdle model coefficients (binomial with logit link):
                             Estimate Std. Error z value Pr(>|z|)    
    (Intercept)              -0.55545    0.04305 -12.904   <2e-16 ***
    A                         0.09582    0.07769   1.233    0.217    
    B                         0.06450    0.06931   0.931    0.352    
    C                        -0.97937    0.06872 -14.251   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Theta: count = 0.7544
    Number of iterations in BFGS optimization: 18 
    Log-likelihood: -7558 on 9 Df
    > AIC(Mv7)
    [1] 15133.16
    > vif(Mv7)
                         A                        B                        C
                    3.683669                 2.178420                 2.897415

Offset with population total scaled to 1. ( I divided each country's population total with largest population) -

    Call:
    hurdle(formula = Y~A+B+C+ 
        offset(Pop.Total.Test), data = POP, dist = "negbin", link = "logit")
    
    Pearson residuals:
         Min       1Q   Median       3Q      Max 
    -0.88522 -0.45546 -0.34142 -0.06619 18.90660 
    
    Count model coefficients (truncated negbin with log link):
                             Estimate Std. Error z value Pr(>|z|)    
    (Intercept)               4.15278    0.03762 110.386   <2e-16 ***
    A                        -0.11584    0.06897  -1.680    0.093 .  
    B                         0.02039    0.05119   0.398    0.690    
    C                        -0.45935    0.05182  -8.864   <2e-16 ***
    Log(theta)               -0.07232    0.04557  -1.587    0.113    
    Zero hurdle model coefficients (binomial with logit link):
                             Estimate Std. Error z value Pr(>|z|)    
    (Intercept)              -0.58007    0.04310 -13.458   <2e-16 ***
    A                         0.08308    0.07774   1.069    0.285    
    B                         0.06679    0.06962   0.959    0.337    
    C                        -0.95615    0.06867 -13.923   <2e-16 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Theta: count = 0.9302
    Number of iterations in BFGS optimization: 13 
    Log-likelihood: -7433 on 9 Df
       
    > AIC(Mv7)
    [1] 14884.03

For my second dependent variable (occurrence) - My optimal model was A+B+log(pop.density). To calculate occurrence rate, I offset it with log(pop.total). The code follows below.

    Call:
    hurdle(formula = Occurence ~ A +B+ log(Pop.Density) + 
        offset(log(Pop.Total)), data = Data, dist = "negbin", link = "logit")
    
    Pearson residuals:
        Min      1Q  Median      3Q     Max 
    -1.1211 -0.5798 -0.2758  0.0748 20.7457 
    
    Count model coefficients (truncated negbin with log link):
                           Estimate Std. Error  z value Pr(>|z|)    
    (Intercept)           -16.08087    0.15542 -103.467  < 2e-16 ***
    A                     -0.35991    0.07428   -4.846 1.26e-06 ***
    B                     -0.13195    0.06224   -2.120  0.03399 *  
    log(Pop.Density)       -0.21688    0.03435   -6.314 2.73e-10 ***
    Log(theta)              0.38368    0.12342    3.109  0.00188 ** 
    Zero hurdle model coefficients (binomial with logit link):
                           Estimate Std. Error z value Pr(>|z|)    
    (Intercept)           -16.02224    0.16485 -97.194  < 2e-16 ***
    A                     -0.28540    0.07408  -3.852 0.000117 ***
    B                     -0.09963    0.07744  -1.287 0.198255    
    log(Pop.Density)       -0.08668    0.03829  -2.264 0.023590 *  
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Theta: count = 1.4677
    Number of iterations in BFGS optimization: 12 
    Log-likelihood: -3090 on 9 Df
    > AIC(Mvfrequency)
    [1] 6198.895
    > vif(Mvfrequency)
                     A                     B      log(Pop.Density) 
                 2.776789              3.469304             11.629956

The output when I remove population density as a factor.

    Call:
    hurdle(formula = Occurence ~ A+B + offset(log(Pop.Total)), 
        data = Data, dist = "negbin", link = "logit")
    
    Pearson residuals:
         Min       1Q   Median       3Q      Max 
    -1.04306 -0.57164 -0.26873  0.05331 18.46993 
    
    Count model coefficients (truncated negbin with log link):
                           Estimate Std. Error  z value Pr(>|z|)    
    (Intercept)           -17.02648    0.06424 -265.048  < 2e-16 ***
    A                     -0.38514    0.07957   -4.840  1.3e-06 ***
    B                     -0.13069    0.06732   -1.941   0.0522 .  
    Log(theta)              0.14126    0.12476    1.132   0.2575    
    Zero hurdle model coefficients (binomial with logit link):
                           Estimate Std. Error  z value Pr(>|z|)    
    (Intercept)           -16.38060    0.05235 -312.893  < 2e-16 ***
    A                     -0.31557    0.07265   -4.344  1.4e-05 ***
    B                     -0.07864    0.07668   -1.026    0.305    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
    
    Theta: count = 1.1517
    Number of iterations in BFGS optimization: 10 
    Log-likelihood: -3115 on 7 Df
    > AIC(Mvfrequency)
    [1] 6243.296
    > vif(Mvfrequency)
                   A                     B 
                 2.694534              3.328758

Are you referring to both parts of the model or is this just for the count part or just for the hurdle part? It might help to post output from a simple model with just one predictor. — mdewey, Jul 30 '16 at 15:32
You seem to have (at least) one outlying observation. The largest residual looks suspicious to me. Perhaps you should check that? — mdewey, Jul 30 '16 at 17:09
You are right @mdewey. There are outliers in my data. Here is a screenshot of my plot of my Y (Y axis) with A (X axis) - I added it to my post. I see these outliers. I don't know how to define these outliers so I can filter them to fit in my model. — Sofia, Jul 30 '16 at 19:26
First thing is to see whether this is just over dispersion. Try using negative binomial rathe than Poisson. — mdewey, Jul 30 '16 at 20:59
@mdwey, I performed a negative binomial and the results are included in my post. 1) I did not remove the outliers 2) what does negative log(theta) mean? I have already read the post on theta on this forum (http://stats.stackexchange.com/questions/10419/what-is-theta-in-a-negative-binomial-regression-fitted-with-r) - I have read Mr. Joesph Hilbe's detailed comment but I still don't understand what a negative theta means. Could you please succinctly explain the significance of a negative log of theta? — Sofia, Jul 30 '16 at 21:55

mdewey · Answer 1 · 2016-08-03T08:32:53.813

1

The problem here is that the distribution of the counts does not follow a Poisson. This was evidenced by the very large residuals. Fitting a negative binomial improves that aspect but as is shown by the very small value of $\theta$ the counts are still very concentrated in the low values but with a few large ones. Investigating these large counts would prove instructive. They may be (a) errors (b) values from another process which should have been excluded (c) really, really interesting values which cast light on aspects of the process hitherto unknown. Only careful examination will reveal which is which.

To get a feel for how $\theta$ affects the shape try plotting using dnbinomp for a few example values of $\theta$ and see what happens.

The role of the offset may need some attention too. Recall that an offset is simply a covariate whose coefficient is known to be 1. The most common example is in Poisson regression where a count is being modelled via a log link and the denominator is included as its log. So if we model number of traffic accidents in cities we use log(citysize) as an offset since we believe that if he city is twice as large it will have twice as many accidents. This is a testable assumption, we can include log(citysize) as a covariate in addition to as an offset and if its coefficent as a covariate is different from zero it means our simple assumption about the relationship is wrong.

edited Aug 03 '16 at 08:32

answered Aug 01 '16 at 08:35

mdewey

16,541
22
30
57

Thank you @mdewey! This is quite helpful. The large counts in my dataset are (c ) interesting values which originate from highly populated countries. What would you recommend to improve my model in this case? The options I see are to (1) add another variable to explain the large values but if I use population total, I have a correlation issue (2) To transform my dependent variable by applying a log function. However, because of the zeroes, this doesn't seem feasible. (3) Remove outliers by inserting a threshold. Is there a way (AIC?) to measure which is the best distribution - poisson or NB? – Sofia Aug 01 '16 at 10:47
@Sofia is there any reason why you are using density as an offset rather than total? – mdewey Aug 02 '16 at 07:56
I thought population density would mean more dense the area, the more people would be affected. However, I am probably wrong - and population total is the right choice to model for affectedness/occurrence rates. – Sofia Aug 02 '16 at 15:21
@mdeweyI performed a negative binomial hurdle with and without population total as an offset (I added the results to the initial post). I am very surprised to see that the results with an offset show much higher residuals, a smaller theta, and a very slight AIC improvement. Could the log function applied to the population be the reason? I am not able to offset directly by the population total - R error – Sofia Aug 02 '16 at 15:31
@Sofia I am at a loss now, I think you have tried everything which I can think of. It sees that the model does not fit the data very well so you may have to report that and explain about the outliers. – mdewey Aug 02 '16 at 15:48
Thank you so much for your help - I feel like I took great steps in understanding my analysis because of your inputs. I had an idea to scale the population total to from 0-1(i posted the result above) - to me this result makes a lot of sense and the model fit seems to improve . Does this technique instead of using a log function make sense? or it is completely wrong to do that? – Sofia Aug 02 '16 at 16:55
@Sofia I have edited my answer again to include material about the role of the offset – mdewey Aug 03 '16 at 08:33
Thank you @mdewey. For another dependent of occurrence, I found that the optimal model had A, B, Log(pop.density) has predictors with low VIF. When I used an offset(pop.total) with this optimal model, it gave me a high VIF of Pop.Density and a lower AIC (about a 100) when compared to using the same offset without Pop.Density. It is logical that the VIF would be high for pop.density. But should still keep it in my model? – Sofia Aug 05 '16 at 10:07

Suspiciously low p values and narrow CIs

1 Answers1