Two simple questions regarding GLM

Question

I'm currently doing a modelling project. However, I haven't taken a bunch of statistics classes, so I have to teach myself generalized linear models. I'm reading Generalized Linear Models for Insurance Data (Heller and de Jong, 2008, CUP), and I have two questions:

1. On page 64, it says:

Given a response $y$, the generalized linear model is $f(y)=c(y,\phi)\exp{\frac{y\theta - a(\theta)}{\phi}}$. The equation for $f(y)$ specifies that the distribution of the response is in the exponential family.

Is that the equation for the distribution of $E[y_i|x_i]$ or some other thing? If it's the distribution for $y$ corresponding to a fixed $x_i$, is it possible that even if the plot of $y$ against $x$ looks like a straight line, I should still use GLM instead of simple regression?

update: I guess I should clarify myself a little bit. Currently I have a dataset and my dependent variable is $y$. I made a histogram of $y$ (with frequency on y-axix) and it looks like a gamma curve fits well. Does that essentially imply that I should choose $f(y)$ to be gamma? I kinda doubt it because I suppose $y_i|_{X=x_i}$ and $Y$ are essentially two different things. I hope I'm not confusing you guys.

2. The book suggests that when assuming response $y$ follows a gamma distribution, it is a common practise to use a logarithmic link function. I don't quite understand the reason behind that.

Any suggestion would be great. Thanks!

Generally better to give a reasonably complete reference to save ambiguity. You mean the 2008 book by Piet de Jong and Gillian Heller? — Glen_b, Jul 29 '14 at 00:49

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

2

That is the equation for the distribution of $y$. Its mean $E[y]$ is a function thereof, when we set that mean equal to $\beta^TX_i$ we call it $E[y_i|X_i]$
The logarithmic link is the "canonical link" for the Poisson. More info here.

This setup is developed in chapter 4 of Categorical Data Analysis by Agresti. It's the book I used in my GLM class and it's also not bad for self-study, imo.

edited Apr 13 '17 at 12:44

Community

1

answered Jul 28 '14 at 16:44

shadowtalker

11,395
3
49
109

So the distribution of $y_i$ is conditioned on a certain value of $x$, is that correct? – 3x89g2 Jul 30 '14 at 20:18
Yes. If you're ever having trouble with this on an intuitive level, just remember that your typical old linear regression is equivalent to a GLM with a Gaussian $y$ and identity link. – shadowtalker Jul 30 '14 at 23:22
That brings up another (probably silly) question. I first generated a histogram of $y$ (which is severity in my case), with frequency on y-axix. "It looks like gamma. Let's use gamma distribution" that's why my manager told me. But if gamma refers to the conditional distribution, then I don't see why his suggestion makes sense... – 3x89g2 Jul 30 '14 at 23:49
Well, the first moment of the distribution of $y_i$ is conditioned on $x_i$, and the parameters of $f(y_i)$ thereby. Assuming that the conditional and marginal distributions are in the same family is a bit less severe than assuming they're actually the same. I've always been a bit uncomfortable with that process myself, but that's what people seem to typically do. – shadowtalker Jul 31 '14 at 01:08

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

The equation is a general form for the broad class of densities in the exponential family (i.e. that's the pdf).

If it's the distribution for y corresponding to a fixed $x_i$, is it possible that even if the plot of $y$ against $x$ looks like a straight line, I should still use GLM instead of simple regression?

The equation for the conditional density is unrelated to the form of the relationship between $y$ and $x$. It is perfectly possible to fit a linear function (via the identity link) with an exponential family conditional density. Which is to say, yes, you can still use GLMs when it's a straight line. Indeed the Gaussian is in the exponential family, so you can still do regression there also.

e.g. here's a straight line fit with a Gamma response:

> summary(glm(dist~speed,cars,family=Gamma(link=identity)))

Call:
glm(formula = dist ~ speed, family = Gamma(link = identity), 
    data = cars)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.07986  -0.29703  -0.06053   0.22879   0.87150  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7.5843     2.1292  -3.562 0.000843 ***
speed         3.2106     0.2556  12.563  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 0.1617597)

    Null deviance: 22.4827  on 49  degrees of freedom
Residual deviance:  8.0945  on 48  degrees of freedom
AIC: 411.79

Number of Fisher Scoring iterations: 8

enter image description here

The Gamma linear fit is in red, the least squares fit is in blue. See the discussion here as to why a gamma model, even with identity link is better in this case (essentially, none of the fitted stopping distances are negative). It's still less than perfect (it suggests a 0 and then negative stopping distance at a positive speed), but its fit is at least plausible within the range of $x$ values we actually have, which is a very useful property to have.

(Of course, even better would be to fit a more plausible model.)

The book suggests that when assuming response y follows a gamma distribution, it is a common practise to use a logarithmic link function.

In insurance and many other financial applications certainly. Partly that's because the relationships involving things like money tend to be multiplicative, and are broadly understood in that form.

I made a histogram of y (with frequency on y-axis) and it looks like a gamma curve fits well.

It may look like that, but it's not necessarily very meaningful, and doesn't relate to the assumption.

Does that essentially imply that I should choose f(y) to be gamma? I kinda doubt it because I suppose yi|X=xi and Y are essentially two different things.

You're correct to doubt it. It's the conditional distribution that's assumed to be gamma.

If $y$ depends on $x$, the unconditional distribution of $y$ will be a mixture of those conditional distributions and may not be meaningful. It could look completely different from gamma; a different pattern of $x$ values could change the y-histogram dramatically, while leaving the conditional distributions unchanged.

So here the gamma distribution refers to the distribution of $y$ for a fixed value of $x$ (say 10), not the desity curve we fit to the histogram of $y$. Is that correct? — 3x89g2, Jul 31 '14 at 01:51
Yes, the conditional distribution of $[y|x]$, that's right. If $y$ depends on $x$, the unconditional distribution of $y$ will be a mixture of those conditional distributions and may not be meaningful at all. — Glen_b, Jul 31 '14 at 02:08
OK. So now if I can fit a nice gamma curve on my histogram of $y$, that does not necessarily imply we should choose gamma distribution from exponential families, right? — 3x89g2, Jul 31 '14 at 02:36
I can only say "yes" so many times, and in so many ways. After that, there's nothing to add. — Glen_b, Jul 31 '14 at 02:43

Two simple questions regarding GLM

2 Answers2

Linked