8

When building models with the glm function in R, one needs to specify the family. A family specifies an error distribution (or variance) function and a link function. For example, when I perform a logistic regression, I use the binomial(link = "logit") family.

What are (or represent) the error distribution (or variance) and link function in R ?

I assume that the link function is the type of model built (hence why using the logit function for the logistic regression. But I am not too sure about the error distribution function.

I had a look at R's documentation but could not find detailed information other than how to use them and what parameters can be specified.

user5365075
  • 183
  • 1
  • 1
  • 5

2 Answers2

17

You don't specify the "error" distribution, you specify the conditional distribution of the response.

When you type the name of the family (such as binomial) that specifies the conditional distribution to be binomial, and that implies the variance function (e.g. in the case of the binomial it is $\mu(1-\mu)$). If you choose a different family you get a different variance function (for Poisson it's $\mu$, for Gamma it's $\mu^2$, for Gaussian it's constant, for inverse Gaussian its $\mu^3$, and so on).

[For some cases (e.g. logistic regression) you can take a latent-variable approach to the GLM - and in that case, you might possibly regard the distribution of the latent variable as a form of "error distribution".]

The link function determines how the mean ($\mu$) and the linear predictor ($\eta=X\beta$) are related. Specifically, if $\eta=g(\mu)$ then $g$ is called the link function.

You can find tables of the variance functions and the canonical link functions (which have some convenient properties) for commonly-used members of the exponential class in many standard books as well as all over the place on the internet. Here's a small one:

\begin{array}{lcll} \textit{Family} & \textit{ Variance fn } & \textit{Canonical link function } & \textit{Other common links } \\ \hline \text{Gaussian} & \text{constant} &\:\:\:\: \mu\qquad\qquad \text{(identity)} & \\ \text{Binomial} &\: \mu(1-\mu) & \log(\frac{\mu}{1-\mu})\;\qquad \:\:\:\,\text{(logit)} & \text{probit, cloglog} \\ \text{Poisson} &\: \mu &\: \log(\mu)\qquad\qquad\:\:\, \text{(log)} & \text{identity} \\ \text{Gamma} &\: \mu^2 &\:\: 1/\mu\quad\:\:\:\qquad \text{(inverse)} & \log \\ \text{Inverse Gaussian} &\: \mu^3 &\:\: 1/\mu^2 & \log \end{array}

(R implements these in fairly typical fashion, and in the cases mentioned above will use the canonical link if you don't specify one)

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • 2
    My recollection is that some of my social science professors refer to the "error distribution" of GLMs. Not sure why... – Sycorax May 13 '16 at 17:50
  • 4
    The error distribution would be the conditional distribution (i.e. for a given x) of y-E(y|x), but in general it would be a different distribution at each distinct value of $\eta$. Consider, for example, logistic regression; at a specific combination of the predictors, the probability is on two points, but it's on a different two points each time, It's not usually a productive way to think about GLMs. However, for some cases you can take a latent-variable approach and you could arguably talk about that as an error distribution. I should mention that in my answer. – Glen_b May 13 '16 at 17:59
  • 2
    My first boss's largest pet peeve was calling it an "error distribution". He discussed this with me on my second day of work. At the time, I didn't get it, but I'm happy to say that I get it now! – Matthew Drury May 13 '16 at 19:28
  • Great explanation, thank you very much ! After reading the wikipedia articles and your answer, I get it :) – user5365075 May 13 '16 at 22:07
  • In a certain sense, GLM converts a problem with the conditional distribution into a linear problem with a Gaussian noise/error term where the mean and the variance (which is now not constant) are transformed. This is then solved iteratively. So, from that point of view GLM has something with the term 'error distribution'. – Sextus Empiricus Oct 13 '21 at 14:30
5

In R, if you read the documentation for the function ?family, you will see the default links in a list at the top:

Usage

family(object, ...)  

binomial(link = "logit")  
gaussian(link = "identity")  
Gamma(link = "inverse")  
inverse.gaussian(link = "1/mu^2")  
poisson(link = "log")  
quasi(link = "identity", variance = "constant")  
quasibinomial(link = "logit")  
quasipoisson(link = "log")

You might notice that the default links tend to be the canonical links for the various distributions. However, you can specify alternative links (e.g., family=binomial(link="probit")), if you prefer. Any function that maps the range of the parameter being fitted (e.g., for logistic regression $\pi_i \in (0, 1)$) to the possible range of the model's right hand side (always $(-\infty, \infty)$) can be acceptable. In fact, you can use a function that doesn't meet this standard so long as the data in your sample don't cause the fitted parameter to go outside of the acceptable range. (For instance, people sometimes use the identity function as their link with count data, or with proportions—e.g., polling results—when the fitted values aren't near the bounds.)

I suspect you would benefit from an overview of the generalized linear model and link functions. It may help you to read my answer here: Difference between logit and probit models, which ends up doing some of that even though it was written in a different context. You can also peruse some of the threads categorized under the tag.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • Thank you for the great answer ! I want to upvote it but don't have enough reputation to do so :( I will as soon as I have enough points. The other answer was also great and I ended up understanding it slightly better than yours, but I appreciate the tie taken into writing it. Thank you so much ! – user5365075 May 13 '16 at 22:10
  • You're welcome, @user5365075. Glen_b's answer is better here, & should be the accepted one, so there's no problem. I answered after him; I just added a couple extra points to supplement his answer. – gung - Reinstate Monica May 13 '16 at 22:56