3

I am struggling with the understanding of a saturated model.

As far as I know, the saturated model is the model that have as many parameter as the data points. But I don't know how to build it or what is the exact form of the saturated model.

For illustration, I have the example as follow:

$(Y, X_1, X_2, X_3, X_4)$ = $(1, 2, 3, 4,5); (2, 3,4, 5,6); (3,4,5, 6,7), (3, 5, 6,7,8)$

where $Y$ is a Poisson distribution, $X_1, X_2, X_3, X_4$ are the independent variable. The link function is log.

I think the saturated model may have the form:
$log(Y) = coef_1 * X_1 + coef_2 * X_2 + coef_3 * X_3 + coef_4 * ???$

(4 coefficients/ parameters as we have 4 data points)

And to fill in $???$, I think we have many way to choose : $X_1 * X_2, X_1/X_3$ or even $X_1* X_2 * X_3 / X_4$...

So I would like to have 2 question please:

  1. How many saturated model are their given a dataset for a GLM model ? (i.e. we fix the hypothesis that Y follow some distributions and also fix the dataset) and what is its form ? Is there any general principle to construct the saturated model ?

  2. If there are more than 1 saturated models, what is the "real saturated" model ? (because as I know, the saturated model is defined to be the model that fit perfectly the dataset)

Thank you very much for your help!

1 Answers1

3

The question is not specific to GLMs, or to Poisson models. This applies also to any regression model.

A saturated model is one in which there are as many estimated parameters as observations, as you say. By definition, this will lead to a perfect fit, but will be of little use statistically.

  1. How many saturated model are their given a dataset for a GLM model ?

Given an arbitrary dataset, you can construct as many saturated models as you wish. If there are insufficient variables, then you can add higher order terms, interactions, or other derived variables such a logarithms or fractional powers. Once the number of parameters equals the number of observations, the model will be saturated.

  1. If there are more than 1 saturated models, what is the "real saturated" model ? (because as I know, the saturated model is defined to be the model that fit perfectly the dataset)

I don't think the term "real saturated" is well-defined. If you have $n$ observations and $p$ variables, then (assuming a model with no intercept) with $n >p$ then you can include all $p$ variables, and just add further terms until you reach saturation. With $n = p$ then you can include all $p$ variables, and the model will be saturated, and with $n<p$ you can choose whatever variables you want (as well as derived terms) to achieve saturation.

Robert Long
  • 53,316
  • 10
  • 84
  • 148
  • Hi, thank you very much for your answer. It helps me a lot. Actually the purpose behind this question is to know how to compute the log-likelihood of a saturated model (in order to calculate the deviance eventually). My concerne is that if we have so many saturated models, I think we will have so many log likelihood of the saturated model. But it seems to me there is only 1 log likelihood of the saturated model right ? Or if there is more than 1, do you know which one the statistical software use to calculate the deviance ? Thank you!! – InTheSearchForKnowledge Jun 07 '21 at 19:46
  • You're welcome. I'm not sure what you mean. There is no unique saturated model. What software are you referring to that calculates a "saturated model deviance" ? Could you provide an example ? – Robert Long Jun 07 '21 at 19:48
  • I use R. The definition of deviance of a GLM Model is $2 * ( log L_{Model} - log L_{Saturated Model}$ I know that we can compute $log L_{Model}$. So to compute the deviance, we have to compute $log L_{Saturated Model}$, but since there are many, I don't know how to compute it. – InTheSearchForKnowledge Jun 07 '21 at 19:56
  • Ahh, OK I see what you mean. So, we usually use deviance to compare 2 models. The saturated deviance (however we define it) will be the same for all models fitted to this dataset so we can think of it as a constant, when comparing the deviance of two models, the constant obviously cancels out. So I don't think you need to actually compute the likelihood of the saturated model. – Robert Long Jun 07 '21 at 20:04
  • Hi, thank you very much for your help. But as shown in this post https://stats.stackexchange.com/questions/108995/interpreting-residual-and-null-deviance-in-glm-r I can see that after I run the GLM regression, the output of the model include "residual deviance" (which is defined as my previous comment). So I think we have to know how to calculate the Log Likelihood of Saturated model. What do you think about it ? – InTheSearchForKnowledge Jun 07 '21 at 20:18
  • OK this is interesting. But I think the log likelihood for any saturated logistic regression model will be zero. – Robert Long Jun 07 '21 at 20:26
  • I think no https://stats.stackexchange.com/questions/184753/in-a-glm-is-the-log-likelihood-of-the-saturated-model-always-zero, so it confuses me a lot – InTheSearchForKnowledge Jun 07 '21 at 20:27
  • 1
    It confuses me too. I was actually thinking about a logistic model and the answer by @Taylor in that thread does show that it is zero for a logistic model (with ungrouped) data, but you have poisson data....... – Robert Long Jun 07 '21 at 20:31
  • I would recommend that you ask a new question, particularly about how to calculate the deviance for a poisson GLM. I think my answer explains about how to build a saturated model (as per the question) but hopefully you will get answers about the deviance if you ask specifically about that. – Robert Long Jun 08 '21 at 08:10
  • Hi Robert, after having extensive research, I think that given a model, there can be many saturated model. But if given a model and the assumption about the distribution function of the explained variable ($i.e.$ for example, i say that $Y_i$ follows $Poisson(\lambda_i)$), I think there is only 1 saturated model (because the saturated model is fitted by using maximum likelihood estimation, and the result give a unique answer). What do you think about it ? – InTheSearchForKnowledge Jun 09 '21 at 20:27
  • That makes sense to me ! I've seen your new question, so it might be worth adding that to the question. There are a number of experts in GLMs on here who can hopefully add some light to this. – Robert Long Jun 09 '21 at 20:29
  • There was a typo in my last comment. I mean given a dataset, there are many saturated models. And given a dataset + assumption on the distribution of explained variable, there should be only 1. Thank you very much for your kind support! It helps a lot as I have to self study without any professor to ask. – InTheSearchForKnowledge Jun 09 '21 at 20:32