Help understand the virtue of generalized linear models

Question

On page 4 of https://www.sagepub.com/sites/default/files/upm-binaries/21121_Chapter_15.pdf, the authors state the following strength of generalized models, which I don't quite understand.

Indeed, one of the strengths of the GLM paradigm - in contrast to transformations of the response variable in linear regression - is that the choice of linearizing transformation is partly separated from the distribution of the response, and the same transformation does not have to both normalize the distribution of Y and make its regression on the Xs linear. The specific links that may be used vary from one family to another and also—to a certain extent—from one software implementation of GLMs to another. For example, it would not be promising to use the identity, log, inverse, inverse-square, or square-root links with binomial data, nor would it be sensible to use the logit, probit, log-log, or complementary log-log link with nonbinomial data.

I understand the transformation that makes the regression linear is the link function. But what do they mean by the transformation that normalizes the distribution of Y?
What would the distributions look like if the transformations had to be the same?
How do the examples justify the property stated? Seems like they are talking about examples where it is not advisable to use arbitrary link functions with a given distribution, but the strength they claim is that you can use arbitrary functions.

https://www.amazon.com/Applied-Regression-Analysis-Generalized-Linear/dp/0761930426 — elexhobby, Mar 06 '21 at 20:17

gung - Reinstate Monica · Accepted Answer · 2021-03-05T18:30:51.760

The ordinary least squares regression model assumes that the errors are normally distributed (and with constant variance). Equivalently, you could say that the conditional distributions of $Y$ are normal. However, they often aren't; for example, they can be badly skewed, with differing residual variances, the appearance of probable 'outliers', etc. One way to deal with these somewhat common problems is to transform $Y$. For instance, it often turns out to be helpful to take the logarithm of $Y$ and all those problems go away. In such a case, the conditional distribuitons of $Y$ become normal. That's what they're referring to. However, with Bernoulli data ($Y \in \{0, 1\}$), no transformation will ever make the conditional distribution normal—it will always be Bernoulli. The point of the link function is not to make $Y$ normal. (In fact, the link function is not even applied to $Y$, it is applied to the parameter that governs the behavior of the conditional distribution. In the case of the Bernoulli, that's the conditional probability, $p$.) Instead, the reason for the link function is to make it possible for the right hand side to model the needed parameter.

It may help to read some of my existing answers that are related to this:
- Difference between logit and probit models
- Is the logit function always the best for regression modeling of binary data?
I'm not sure how to answer this. It seems to be based on a mistaken premise.
The first set of transformations are members of the set of power transformations. They are (some of the) ways to transform $Y$ values for OLS regression. The second set are possible link functions for Bernoulli data. I don't see "arbitrary" in the quote from the book. It is certainly true that there are essentially infinite transformations to normalize the conditional distribution of $Y$, and there are essentially infinite transformations that can be used as link functions in a binomial regression model, but in general these are different infinite sets and there are also infinite sets that cannot be used for each. For a power transformation to correct skew, you want a monotonic transformation that will progressively shrink lager values down (e.g., $\sqrt{Y}$) or progressively expand them up (e.g., $Y^2$); for a link function for a binary response, you want a function that will transform $(0, 1) \rightarrow (-\infty, \infty)$

Help understand the virtue of generalized linear models

1 Answers1