0

I am debating this with some friends at the moment and would like to know what the stack exchange community can add to this discussion.

When selecting the choice of the distribution and link function, some say that it is less important, even if the outcome variable does not follow the selected distribution. I beg to differ, because this would invalidate all statistical results obtained from the glm. However, it seems that this is quite common to ignore if the distribution just 'seems' to follow a particular distribution.

So, my question is, how important is it that the choice of the distribution is properly chosen? Hence, by making use of statistical test (e.g. ks test).

If anyone could also provide an example as to why this is/is not important I would appreciate it.

Thanks!

  • When we're talking about discrete distributions such as Bernoulli or Binomial the ks.test can tell you nothing more than you already know; there is a yes or no answer to whether or not they can be used. What kind of GLM did you have in mind? – Digio Jul 29 '17 at 21:08
  • Hi, indeed, I was thinking more in lines of the continuous distributions... Gamma/normal – Charl Francois Marais Jul 29 '17 at 22:04
  • Gaussian GLM with identity link is very much like Gaussian-noise linear regression with MLE. The distribution should be diagnosed on the residuals, not the response. Still, I think the other assumptions of linear regression on model structure and the residuals are more important than the distribution. So I'm afraid my opinion is closer to that of your friends. – Digio Jul 31 '17 at 08:24
  • @Digio alright, this is something that is not entirely clear to me. What you are saying is that the choice of the distribution is made on the residuals and not the outcome variable? Hence, you will base your decision regarding the choice of the distribution only after you have fit the first model? – Charl Francois Marais Jul 31 '17 at 09:10
  • Yes, a parametric model is based on assumptions and the validity of these assumptions can only be assessed a posteriori (after estimation). I would, however, use the term "choice of model" rather than "choice of distribution" since you start off by evaluating a model structure and not a distribution (this can be expanded even to discrete response GLM). – Digio Jul 31 '17 at 09:45
  • Awesome advise Digio! Would love to discuss this in more detail, just because it is super interesting... So here are my thoughts: – Charl Francois Marais Jul 31 '17 at 10:08
  • If we fit a glm to only the outcome variable hence glm(y~1, family=gaussian(link=identity)), then in essence we are fitting a normal distribution, since we will have something like y = b0 + b1x1 + ... + e, but here we only have y = b0 + e and we then say that y ~ N(mu, sigma^2), mu = b0 and sigma^2 = the sigma^2 from e... So... Therefore it means that the distribution of y is normal, just as the errors. If we were to include other terms the distribution remains the same – Charl Francois Marais Jul 31 '17 at 10:17
  • The distribution of the residuals will obviously be affected by the inclusion of independent variables. Anyway, I think it's best to start a new question on this because the comment section is not supposed to serve the purposes of general discussion. – Digio Jul 31 '17 at 14:16
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/63144/discussion-between-charl-francois-marais-and-digio). – Charl Francois Marais Aug 01 '17 at 08:10

0 Answers0