4

If I have two OLS models with the same number of parameters, all of them zero p-value, then next thing I look at is which one has the largest $R^2$.

But in the case of GLM, how do I decide which of two models is best? (Again, assuming that both have the same number of parameters, all with zero p-value)

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
oneloop
  • 598
  • 3
  • 14
  • 2
    P-value is never zero. What you may be looking at is all zeros as printed to the number of decimal places your software is showing. but such P-values are not zero. So, for example, 0.000 just means <0.0005. – Nick Cox Mar 26 '18 at 00:04
  • https://stats.stackexchange.com/questions/68066/coefficient-of-determination-for-binary-responses and https://stats.stackexchange.com/questions/232471/why-does-the-glm-function-does-not-return-an-r2-value are two of several posts here giving a meaning to $R^2$ for GLMs. – Nick Cox Mar 26 '18 at 00:09
  • Yes, of course, but it seemed to me that "zero p-value" would be more economical than "the p-value equals 10 to the power of minus a large number". Thanks for the links. – oneloop Mar 26 '18 at 08:50
  • 1
    Good that you know, but (1) For many readers new to and/or struggling with statistics there is no "of course" about that statement (2) In several fields, notably bioinformatics, tracking very small P-values is of prime interest. – Nick Cox Mar 26 '18 at 09:42
  • Also check [AIC](https://en.wikipedia.org/wiki/Akaike_information_criterion) and [BIC](https://en.wikipedia.org/wiki/Bayesian_information_criterion). – Zhubarb Mar 26 '18 at 10:23
  • @Berkmeister I agree, but since the models have the same number of parameters, your proposal as well as Alexis's answer below just boil down to comparing the likelihoods. – Knarpie Mar 26 '18 at 10:26

1 Answers1

6

You can use the generalized $\mathbf{R^{2}}$, proposed by Maddala for binomial models, and extended by Magee, and independently developed by Cox & Snell, but refined by Nagelkerke:

$$R^{2}_{\text{CS}} = 1 - \left(\frac{L\left(0\right)}{L(\hat{\theta})}\right)^{\frac{2}{n}}$$

where $L(0)$ is the likelihood of the null model (i.e. $\text{link}(y) = \beta_{0}$), and $L(\hat{\theta})$ is the model fitted on your predictors.

Nagelkerke says of the $R_{\text{CS}}^{2}$ "It is easily found that this definition of $R^{2}$ has the following properties.

  1. It is consistent with classical $R^{2}$, that is the general definition applied to e.g. linear regression yields the classical $R^{2}$.

  2. It is consistent with maximum likelihood as an estimation method, i.e. the maximum likelihood estimates of the model parameters maximize $R^{2}$.

  3. It is asymptotically independent of the sample size $n$.

  4. It has an interpretation as the proportion of explained 'variation', or rather, $1-R^{2}$: has the interpretation of the proportion of unexplained 'variation'. [More nuanced things in the article]

  5. It is dimensionless, i.e. it does not depend on the units used.

  6. Replacing the factor $2/n$… by $k/n$ yields a generalization of the proportion of the $k$th central moment explained by the model.

  7. Let $y$ have a probability density $P(y|\beta)$, then using Taylor expansion, it can be shown that to a first order approximation, $R^{2}$ is the square of the Pearson correlation between $x$ and the efficient score of the model $p(.)$, that is the derivative with respect to $\beta$ of $\log\left(p(y|\beta x +\alpha ) \right)$ at $\beta = 0$."

However, Nagelkerke observed that the Cox-Snell formulation gives a maximum $R^{2}<1$, and proposed the form:

$$\overline{R}^{2}=\frac{R^{2}_{\text{CS}}}{1-L(0)^{\frac{2}{n}}}$$

which is 0 when $L(\hat{\theta})=L(0)$, and is 1 when the model perfectly fits the data.

References

Cox, D. R.; Snell, E. J. (1989). The Analysis of Binary Data (2nd ed.). Chapman and Hall.

Maddala, G. S. (1983), Limited-Dependent and Qualitative Variables in Econometrics, Cambridge,U.K: Cambridge University Press.

Magee, L. (1990). $R^{2}$ measures based on Wald and likelihood ratio joint significance tests. The American Statistician, 44(3):250–253.

Nagelkerke, N. J. D. (1991). A note on a general definition of the coefficient of determination. Biometrika, 78(3):691–692.

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • Hi thanks for your answer. I guess this is the most canonical answer, but for completeness I'm linking a website that I found with multiple "pseudo R^2" alternatives: https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/ – oneloop Mar 26 '18 at 08:23
  • Actually, there's something I'm missing. You didn't specify what the $n$ is. And if $n$ is the sample size like in the wikipedia article about it, then for very large sample size $R^2$ goes to zero? – oneloop Mar 26 '18 at 08:28
  • @oneloop $n$ is the sample size. The $R^{2}$ ***does not*** go to zero (notice the $1-$ portion of the formula, and notice the portion that is being raised to $2/n$). – Alexis Mar 26 '18 at 17:34
  • @oneloop Not at all! Note that on the link you provided, Nagelkerke and Uhler make an adjustment to the Cox-Snell-Magee-Maddala $R^{2}$ that insures 0–1 range... appending it to my answer now. – Alexis Mar 27 '18 at 15:03