Explanation of a censored regression model

Question

I'm wondering how to explain the logSigma and its p.value in a censored regression model:

require(cenReg)
data( "Affairs", package = "AER" )
estResult <- censReg( affairs ~ age + yearsmarried + religiousness +
                           occupation + rating, data = Affairs )
summary(estResult)

Call:
censReg(formula = affairs ~ age + yearsmarried + religiousness + 
    occupation + rating, data = Affairs)

Observations:
         Total  Left-censored     Uncensored Right-censored 
           601            451            150              0 

Coefficients:
              Estimate Std. error t value  Pr(> t)    
(Intercept)    8.17420    2.74145   2.982  0.00287 ** 
age           -0.17933    0.07909  -2.267  0.02337 *  
yearsmarried   0.55414    0.13452   4.119 3.80e-05 ***
religiousness -1.68622    0.40375  -4.176 2.96e-05 ***
occupation     0.32605    0.25442   1.282  0.20001    
rating        -2.28497    0.40783  -5.603 2.11e-08 ***
logSigma       2.10986    0.06710  31.444  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Newton-Raphson maximisation, 7 iterations
Return code 1: gradient close to zero
Log-likelihood: -705.6 on 7 Df

My question is how to explain the logSigma and its significant p.value in the above model.

From the documentation, I gather `occupation` is categorical, thus your model appears to be misspecified. What is `logSigma`? I don't see that in the help file. — gung - Reinstate Monica, Jun 16 '14 at 20:10
From the documentation for `summary.censReg()`, I gather it is a "logical value indicating whether the variance(s) of the model should be printed logarithmized". There is a vignette listed on the package's [webpage on CRAN](http://cran.r-project.org/web/packages/censReg/index.html). The `logSigma` argument is discussed in section 2.6. I don't know enough about the topic to say more. — gung - Reinstate Monica, Jun 16 '14 at 20:52
@gung Your comment confounds the `logSigma` used as an argument to `summary` with the `logSigma` reported in the output: although you describe the former, this question is about the latter. I believe that all the values printed after the estimate of 2.10986 are merely artifacts of the vectorized code used to produce the t values and p values. After all, it scarcely is of interest to test whether a log variance equals zero! See the manual page or the code for `summary.maxLik`, which is what actually produces this output. — whuber, Jun 16 '14 at 21:50
David, I am curious about the meaning of these variables. One might guess that `affairs` is a count of extramarital affairs. If so, how could it be left-censored? That would indicate most of your data are of the form "so-and-so reports $n$ or fewer affairs" with values of $n\ge 1$ (for after all, any count of zero or fewer is just...zero.) I am wondering if perhaps you might be applying censored regression in a situation where some other procedure might be called for, such as a zero-inflated generalized linear model. — whuber, Jun 16 '14 at 21:56
@whuber The censored regression and the corner solution models both have the same Tobit empirical representation. Wooldridge has an example of utility maximization for charitable giving, which has a lower limit at zero because for some agents giving nothing to charity is optimal. — dimitriy, Jun 16 '14 at 22:26
Thank you, @Dimitriy. What set me wondering is the *huge* value of `logSigma` alluded to in the "pretty bad" comment in your answer: it corresponds to an unrealistic SD in the residuals, if indeed the response is counting something as scarce as extramarital affairs. What you say, then, seems to imply that a Tobit or censored approach to this analysis might not work at all and some other model altogether is called for. — whuber, Jun 16 '14 at 22:38

dimitriy · Accepted Answer · 2014-06-17T20:12:49.587

By default, the estimated standard deviation of the residuals ($\sigma$) is returned as $\ln(\sigma)$ since that is how the Tobit log likelihood maximization is performed. If you use coef(estResult,logSigma = FALSE), you will get $\sigma$ instead, which is analogous to the square root of the residual variance in OLS regression. That value can be compared to the standard deviation of affairs. If it is much smaller, you may have a reasonably good model. Or you can do the exponentiation yourself with a calculator and use delta method for the variance. You will also need $\sigma$ to construct some of the marginal effects.

I don't think the hypothesis test about $\ln \sigma$ and the corresponding p-value have a clear interpretation, whereas the other coefficients can be interpreted as the marginal effects on the uncensored outcome, so the p-value on the null that the ME is zero makes sense for them. I believe R is just treating $\ln \sigma$ as another parameter.

Here's my replication of your analysis in Stata (where I am also treating the categorical variables as continuous) confirming what I wrote above.

First we load the affairs data:

. ssc install bcuse
checking bcuse consistency and verifying not already installed...
all files already exist and are up to date.

. bcuse affairs

Contains data from http://fmwww.bc.edu/ec-p/data/wooldridge/affairs.dta
  obs:           601                          
 vars:            19                          22 May 2002 11:49
 size:        15,626                          
-------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------------------------------------------
id              int     %9.0g                 identifier
male            byte    %9.0g                 =1 if male
age             float   %9.0g                 in years
yrsmarr         float   %9.0g                 years married
kids            byte    %9.0g                 =1 if have kids
relig           byte    %9.0g                 5 = very relig., 4 = somewhat, 3 = slightly, 2 = not at
                                                all, 1 = anti
educ            byte    %9.0g                 years schooling
occup           byte    %9.0g                 occupation, reverse Hollingshead scale
ratemarr        byte    %9.0g                 5 = vry hap marr, 4 = hap than avg, 3 = avg, 2 = smewht
                                                unhap, 1 = vry unhap
naffairs        byte    %9.0g                 number of affairs within last year
affair          byte    %9.0g                 =1 if had at least one affair
vryhap          byte    %9.0g                 ratemarr == 5
hapavg          byte    %9.0g                 ratemarr == 4
avgmarr         byte    %9.0g                 ratemarr == 3
unhap           byte    %9.0g                 ratemarr == 2
vryrel          byte    %9.0g                 relig == 5
smerel          byte    %9.0g                 relig == 4
slghtrel        byte    %9.0g                 relig == 3
notrel          byte    %9.0g                 relig == 2
-------------------------------------------------------------------------------------------------------
Sorted by:  id

Here's the Stata equivalent of your censReg:

. tobit naffair age yrsmarr relig occup ratemarr , ll(0)

Tobit regression                                  Number of obs   =        601
                                                  LR chi2(5)      =      78.32
                                                  Prob > chi2     =     0.0000
Log likelihood = -705.57622                       Pseudo R2       =     0.0526

------------------------------------------------------------------------------
    naffairs |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |  -.1793326   .0790928    -2.27   0.024    -.3346672    -.023998
     yrsmarr |   .5541418   .1345172     4.12   0.000     .2899564    .8183273
       relig |   -1.68622   .4037495    -4.18   0.000    -2.479165   -.8932758
       occup |   .3260532   .2544235     1.28   0.201    -.1736224    .8257289
    ratemarr |  -2.284973   .4078258    -5.60   0.000    -3.085923   -1.484022
       _cons |   8.174197   2.741432     2.98   0.003     2.790155    13.55824
-------------+----------------------------------------------------------------
      /sigma |    8.24708   .5533582                      7.160311    9.333849
------------------------------------------------------------------------------
  Obs. summary:        451  left-censored observations at naffairs<=0
                       150     uncensored observations
                         0 right-censored observations

Stata reports $\sigma$ rather than $\ln \sigma$, but we can take logs too:

. nlcom logSigma: ln(_b[/sigma])

    logSigma:  ln(_b[/sigma])

------------------------------------------------------------------------------
    naffairs |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    logSigma |   2.109859   .0670975    31.44   0.000     1.978351    2.241368

Note that this matches your R output. The z stat and the p-value are for the null that the log standard deviation of the residual is zero, which is definitely not the case here.

Here are the summary stats for the outcome for comparison to $\sigma$:

. sum naffairs        
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
        naffairs |       601    1.455907    3.298758          0         12

In this case, the model looks pretty bad, which is often the case with Tobit models, especially "toy" ones meant to illustrate syntax.

I'm not sufficiently familiar w/ this, but I gather from the help that the line in the summary output & the associated p-value is a test of whether you *ought* to be using the ln(s) instead of s for some reason. ln(s) would then presumably be multiplied by the vcov matrix to calculate the SEs of the coefficients etc. I don't know why it would be that you should be using ln(s) instead of s in this way, but that is my read of the materials. — gung - Reinstate Monica, Jun 16 '14 at 21:15
I think gung's comment is also what I would like to understand. — David Z, Jun 17 '14 at 12:28
I checked around and it seems the marginal effects are not that straight forward like described [here](http://stats.stackexchange.com/questions/44274/standardized-solution-for-censored-regression). I think I need more knowledge to understand through. — David Z, Jun 17 '14 at 12:50
@DavidZ I don't believe @gung's comment to be correct, so I unfortunately I cannot explain it to you. You can simulate some data where you know $\sigma$ and use `censreg` on it to make sure I am correct. The various marginal effects formulas can be found in most econometrics textbooks, but I don't know how to get them in R automatically. — dimitriy, Jun 17 '14 at 19:58
@DavidZ, Dimitriy may well be right here. He certainly seems familiar with this topic. My comment was based on 10 minutes of skimming the documentation, which I may well have misinterpreted. — gung - Reinstate Monica, Jun 17 '14 at 21:14
Thanks, Dimitriy! It's a totally fresh topic to me and I think I need gather more knowledge around. — David Z, Jun 18 '14 at 00:56
Thanks, Dimitriy! You do had me a great explanation. I got it now :). — David Z, Jun 18 '14 at 13:16

Explanation of a censored regression model

1 Answers1

Linked