62

As far as I understand the Wald test in the context of logistic regression is used to determine whether a certain predictor variable $X$ is significant or not. It rejects the null hypothesis of the corresponding coefficient being zero.

The test consists of dividing the value of the coefficient by standard error $\sigma$.

What I am confused about is that $X/\sigma$ is also known as Z-score and indicates how likely it is that a given observation comes form the normal distribution (with mean zero).

user695652
  • 1,351
  • 3
  • 15
  • 22
  • 2
    Possible duplicate of [Wald test in regression (OLS and GLMs): t- vs. z-distribution](https://stats.stackexchange.com/questions/56066/wald-test-in-regression-ols-and-glms-t-vs-z-distribution) – Firebug Nov 27 '17 at 21:50
  • 2
    Perhaps it could be the other way around though, as the answer in this one is more developed. – Firebug Nov 27 '17 at 21:51

1 Answers1

107

The estimates of the coefficients and the intercepts in logistic regression (and any GLM) are found via maximum-likelihood estimation (MLE). These estimates are denoted with a hat over the parameters, something like $\hat{\theta}$. Our parameter of interest is denoted $\theta_{0}$ and this is usually 0 as we want to test whether the coefficient differs from 0 or not. From asymptotic theory of MLE, we know that the difference between $\hat{\theta}$ and $\theta_{0}$ will be approximately normally distributed with mean 0 (details can be found in any mathematical statistics book such as Larry Wasserman's All of statistics). Recall that standard errors are nothing else than standard deviations of statistics (Sokal and Rohlf write in their book Biometry: "a statistic is any one of many computed or estimated statistical quantities", e.g. the mean, median, standard deviation, correlation coefficient, regression coefficient, ...). Dividing a normal distribution with mean 0 and standard deviation $\sigma$ by its standard deviation will yield the standard normal distribution with mean 0 and standard deviation 1. The Wald statistic is defined as (e.g. Wasserman (2006): All of Statistics, pages 153, 214-215): $$ W=\frac{(\hat{\beta}-\beta_{0})}{\widehat{\operatorname{se}}(\hat{\beta})}\sim \mathcal{N}(0,1) $$ or $$ W^{2}=\frac{(\hat{\beta}-\beta_{0})^2}{\widehat{\operatorname{Var}}(\hat{\beta})}\sim \chi^{2}_{1} $$ The second form arises from the fact that the square of a standard normal distribution is the $\chi^{2}_{1}$-distribution with 1 degree of freedom (the sum of two squared standard normal distributions would be a $\chi^{2}_{2}$-distribution with 2 degrees of freedom and so on).

Because the parameter of interest is usually 0 (i.e. $\beta_{0}=0$), the Wald statistic simplifies to $$ W=\frac{\hat{\beta}}{\widehat{\operatorname{se}}(\hat{\beta})}\sim \mathcal{N}(0,1) $$ Which is what you described: The estimate of the coefficient divided by its standard error.


When is a $z$ and when a $t$ value used?

The choice between a $z$-value or a $t$-value depends on how the standard error of the coefficients has been calculated. Because the Wald statistic is asymptotically distributed as a standard normal distribution, we can use the $z$-score to calculate the $p$-value. When we, in addition to the coefficients, also have to estimate the residual variance, a $t$-value is used instead of the $z$-value. In ordinary least squares (OLS, normal linear regression), the variance-covariance matrix of the coefficients is $\operatorname{Var}[\hat{\beta}|X]=\sigma^2(X'X)^{-1}$ where $\sigma^2$ is the variance of the residuals (which is unknown and has to be estimated from the data) and $X$ is the design matrix. In OLS, the standard errors of the coefficients are the square roots of the diagonal elements of the variance-covariance matrix. Because we don't know $\sigma^2$, we have to replace it by its estimate $\hat{\sigma}^{2}=s^2$, so: $\widehat{\operatorname{se}}(\hat{\beta_{j}})=\sqrt{s^2(X'X)_{jj}^{-1}}$. Now that's the point: Because we have to estimate the variance of the residuals to calculate the standard error of the coefficients, we need to use a $t$-value and the $t$-distribution.

In logistic (and poisson) regression, the variance of the residuals is related to the mean. If $Y\sim Bin(n, p)$, the mean is $E(Y)=np$ and the variance is $\operatorname{Var}(Y)=np(1-p)$ so the variance and the mean are related. In logistic and poisson regression but not in regression with gaussian errors, we know the expected variance and don't have to estimate it separately. The dispersion parameter $\phi$ indicates if we have more or less than the expected variance. If $\phi=1$ this means we observe the expected amount of variance, whereas $\phi<1$ means that we have less than the expected variance (called underdispersion) and $\phi>1$ means that we have extra variance beyond the expected (called overdispersion). The dispersion parameter in logistic and poisson regression is fixed at 1 which means that we can use the $z$-score. The dispersion parameter . In other regression types such as normal linear regression, we have to estimate the residual variance and thus, a $t$-value is used for calculating the $p$-values. In R, look at these two examples:

Logistic regression

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")

mydata$rank <- factor(mydata$rank)

my.mod <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

summary(my.mod)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
   ---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

Note that the dispersion parameter is fixed at 1 and thus, we get $z$-values.


Normal linear regression (OLS)

summary(lm(Fertility~., data=swiss))

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
Examination      -0.25801    0.25388  -1.016  0.31546    
Education        -0.87094    0.18303  -4.758 2.43e-05 ***
Catholic          0.10412    0.03526   2.953  0.00519 ** 
Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.165 on 41 degrees of freedom

Here, we have to estimate the residual variance (denoted as "Residual standard error") and hence, we use $t$-values instead of $z$-values. Of course, in large samples, the $t$-distribution approximates the normal distribution and the difference doesn't matter.

Another related post can be found here.

COOLSerdash
  • 25,317
  • 8
  • 73
  • 123
  • 1
    Thank you very much for this nice post which answers all my questions. – user695652 May 26 '13 at 21:41
  • 1
    So, practically, regarding the first part of your excellent answer: If for some reason I'd have as an output the odds ratio and the Wald statistic, I could than calculate the standard error from these as: SE = (1/Wald-statistic)*ln(OR) Is this correct? Thanks! – Sander W. van der Laan Aug 10 '15 at 20:50
  • 1
    @SanderW.vanderLaan Thanks for your comment. Yes, I believe that's correct. If you perform a logistic regression, the Wald statistics will be the z-value. – COOLSerdash Aug 11 '15 at 16:13
  • 2
    Such a great answer!!. I do have some revision suggestions: I personally feel this answer is mixing up details with the punch lists. I would put the details of how linear regression is using variance of residuals in a separate graph. – Haitao Du Apr 09 '17 at 01:56
  • 1
    Also for dispersion parameter and the connection to the R code, may be we can open another section or a separation line to talk about. – Haitao Du Apr 09 '17 at 01:58
  • 1
    Just a side note about this answer: the specific formula given for the variance-covariance matrix is from ordinary least squares regression, *not* from logistic regression, which does not use the residual standard error but instead involves a diagonal matrix with the individual Bernoulli variances from the predicted probability for each observation along the diagonal. – ely Aug 09 '18 at 18:39
  • 1
    @ely Thanks for the heads-up. Actually, I mentioned that the presented result is for OLS in the paragraph although it could be made more prominent, I admit. I added a short note to emphasize the fact. – COOLSerdash Aug 10 '18 at 07:00
  • @COOLSerdash " In logistic and poisson regression but not in regression with gaussian errors, we know the expected variance and don't have to estimate it separately." **How do we calculate the expected variance (and how come we know this exactly?) I see that the variance of $Y$ is related to the mean, but 1) don't we need the variance of the coefficient we are testing? and 2) If the variance depends on $p$ -- as does $VAR(Y\vert X)$, we do not know the true value of $p$? – user106860 Apr 27 '20 at 21:23
  • @COOLSerdash Thank you for the excellent answer. A brief question: How does the fixed variance/dispersion (of 1) fit in with over/under dispersion that can still occur in logistic regressions? Sorry this is posted as an answer. I couldn't comment as I have 0 reputation. EDIT: Actually, I think this from Hox et al. (2017) may be the answer: "If the scale factor is set to one, the assumption is made that the observed errors follow the theoretical binomialerror distribution exactly. If the scale factor is significantly higher or lower than one, there is overdispersion or underdispersion. Under- a – WhiteSwanBlackSwan Jul 15 '21 at 09:29