2

My intention is to test the power of a single predictor x in predicting different responses: y1 that is presence/absence and y2that is continuous.

I therefore fitted two models (I work in R environment):

  1. lm(y1 ~ x)
  2. glm(y2 ~ x, family = binomial())

And now I want to compare how x relatively performed in both cases.

It is clear to me that to compare R^2 from lm() and pseudo-R^2 from glm() is no way at all. But what if I use a function that derives pseudo-R^2 for both lm() and glm() such as rcompanion::nagelkerke() and specify there null models as lm(y1 ~ 1) and glm(y2 ~ 1), respectively. Do I get the values that I can later compare between each other, saying x is more efficient in predicting y1 than y2 or vice-versa?

  • 2
    Interesting question +1. I might rephrase the question (particularly title) to be more explicit in asking how to determine which variable `x` is better at predicting, [since that seems to be your real question, not anything specific to pseudo $R^2$.](https://en.wikipedia.org/wiki/XY_problem) – Dave Jan 28 '22 at 14:21
  • True, if you have an idea how to put the title more explicitly linked to the question, please, share it. – Kryštof Chytrý Jan 30 '22 at 14:29
  • I wouldn't attempt to use any pseudo r-squared measure to compare among different types of models. My advice would be to try to figure out exactly what criterion you want to use to determine how well both a continuous variable and a dichotomous variable are predicted. Off the top of my head, I can't think of a way to do this, except for maybe dichotomizing the continuous variable and counting the number of correct predictions for each model (essentially the Count pseudo r-squared described here: https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/ ). – Sal Mangiafico Feb 02 '22 at 14:56

1 Answers1

1

Instead of looking to compare different models on a metric that is scaled similarly (i.e., estimate two different models and compare Nagelkerke $R^2$ values), I would advise that choosing a model where both prediction equations can be incorporated simultaneously and rescaled directly to assess the relative impact of the predictive factor on fit to the outcome would be more comparable.

Consider a set of models like these quite similar to the variables discussed in the original post:

lm(mpg ~ cyl, data = datasets::mtcars)

glm(vs ~ cyl, data = datasets::mtcars, family = binomial(link = "probit"))

Using the {pscl} package's pR2 metrics the $R^2$ metrics are fairly inconclusive and across the 3 $R^2$s (i.e., "McFadden", "r2ML", and "r2CU"). Unfortunately, mpg is scaled very differently than is vs which is clear from the very different underlying likelihoods reported on below.

> lm(mpg ~ cyl, data = mtcars) |> pscl::pR2()
fitting null model for pseudo-r2
         llh      llhNull           G2     McFadden         r2ML         r2CU 
 -81.6532086 -102.3777581   41.4490989    0.2024321    0.7261800    0.7273903 

> glm(vs ~ cyl, data = mtcars, family = binomial(link = "probit")) |> pscl::pR2()
fitting null model for pseudo-r2
        llh     llhNull          G2    McFadden        r2ML        r2CU 
 -8.8896779 -21.9300546  26.0807534   0.5946349   0.5573711   0.7470935 

If, however, both models are included in a single model and a standardized metric of their effect can be computed, the results, I would argue, are far more comparable.

For instance {lavaan} (https://lavaan.ugent.be/) can accommodate both linear and probit models and offers methods that can standardize the results to assist in their comparability. Specifically, {lavaan} estimates covariance structure/structural equation models as that allow for producing a standardized solution by standardizing the results by the estimated observed or latent variances of the outcomes assuming both follow an underlying Normal distribution.

> lavaan("mpg ~ cyl; vs ~ cyl", data = mtcars, ordered = "vs") |> standardizedsolution()
   lhs  op rhs est.std    se        z pvalue ci.lower ci.upper
1  mpg   ~ cyl  -0.982 0.004 -239.905  0.000   -0.990   -0.974
2   vs   ~ cyl  -0.846 0.089   -9.460  0.000   -1.022   -0.671
3   vs   |  t1  -2.743 0.733   -3.744  0.000   -4.178   -1.307
4  mpg  ~~ mpg   0.037 0.008    4.547  0.000    0.021    0.052
5   vs  ~~  vs   0.284 0.151    1.872  0.061   -0.013    0.580
6  cyl  ~~ cyl   1.000 0.000       NA     NA    1.000    1.000
7   vs ~*~  vs   1.000 0.000       NA     NA    1.000    1.000
8  mpg  ~1       0.000 0.000       NA     NA    0.000    0.000
9   vs  ~1       0.000 0.000       NA     NA    0.000    0.000
10 cyl  ~1       3.465 0.000       NA     NA    3.465    3.465

In particular, lines '1' and '2' provide a standardized metric for the effect of cyl on both mpg and vs that can be interpreted in the same way---the standard deviation change in the outcome (in the case of vs on the underlying latent variable) given a standard deviation change in cyl.

Here, it is clear that cyl does a slightly better in terms of the magnitude of the coefficient when compared as standardized by their estimated variances.

I anticipate a similar approach could be adopted for your situation to more directly compare these two predictive models/equations.


Appendix

Note that lavaans unstandardized results match those obtained from the lm and glm models.

> lavaan("mpg ~ cyl; vs ~ cyl", data = mtcars, ordered = "vs") |> coef()
mpg~cyl  vs~cyl 
 -2.876  -0.890

> lm(mpg ~ cyl, data = mtcars)

Call:
lm(formula = mpg ~ cyl, data = mtcars)

Coefficients:
(Intercept)          cyl  
     37.885       -2.876 

> glm(vs ~ cyl, data = mtcars, family = binomial(link = "probit"))

Call:  glm(formula = vs ~ cyl, family = binomial(link = "probit"), data = mtcars)

Coefficients:
(Intercept)          cyl  
       5.15        -0.89  

Degrees of Freedom: 31 Total (i.e. Null);  30 Residual
Null Deviance:      43.86 
Residual Deviance: 17.78    AIC: 21.78
```
jluchman
  • 476
  • 2
  • 11